Visual Querying and Analysis of Large Software Repositories

Size: px
Start display at page:

Download "Visual Querying and Analysis of Large Software Repositories"

Transcription

1 Empirical Software Engineering manuscript No. (will be inserted by the editor) Visual Querying and Analysis of Large Software Repositories Lucian Voinea 1, Alexandru Telea 2 1 Technische Universiteit Eindhoven, The Netherlands, l.voinea@tue.nl 2 Technische Universiteit Eindhoven, The Netherlands, alext@win.tue.nl Received: date / Revised version: date Abstract We present a software framework for visual mining of software repositories. Our framework addresses three aspects: data extraction from repositories, data analysis, and interactive visualization and exploration. Along these, it provides an open structure, extensible with new functionality. We first discuss the challenges of data extraction and storage, and propose a flexible way to deal with implementation inconsistencies in software repositories. We next present a new technique to enrich the raw data with information about artifacts showing similar evolution. Finally, we propose a visualization back-end that supports visual navigation and query of the extracted data at several levels of detail. We demonstrate the applicability of our framework by presenting several case studies performed on industry-size software repositories. 1 Introduction Software Configuration Management (SCM) systems are widely accepted instruments for managing large software development projects containing millions of lines of code spanning thousands of files developed by hundreds of people over many years. SCMs maintain a history of changes in both the structure and contents of the managed project. This information is very suitable for empirical studies on software evolution [1,2,15]. Many SCM systems exist on the market and are used in the daily practice in the software industry. Among the most widespread SCM systems we mention CVS, Subversion, Visual SourceSafe, RCS, CM Synergy, and ClearCase. The Concurrent Versions System (CVS) and Subversion, available via the Open Source community, are very popular SCM systems used by the Open Source community. Many CVS and Subversion repositories covering long evolution periods, e.g., 5-10 years, are freely available for analysis. These repositories are, hence, interesting options for research on software evolution.

2 2 Lucian Voinea, Alexandru Telea However, most repositories and SCM systems are mainly designed for archiving data. For example, CVS and its web-based front-ends offer only a basic querying interface for retrieving a given version of a file or an attribute list with the file state evolution. CVS provides no functionality that enables users to get data overviews easily. Such overviews are essential in case the questions one asks involve many files or file versions, rather than a particular file. Most questions concerning spotting trends in software evolution are of this type. On the other hand, the feedback provided by CVS, e.g., queried state attributes list, is delivered only in a human readable textual format, which makes it unhandy for quick browsing and automatic parsing. This makes the preprocessing needed for answering more complex questions difficult. Subversion is a SCM system built on the same concepts as CVS, but it tries to address some of its shortcomings. Most notably, the Subversion communication protocol is entirely machine parsable, which simplifies the task of data extraction from repositories. However, most SCM systems, including CVS and Subversion, are primarily designed to support the task of archiving software. The main operations they provide are check-in, check-out, and a limited amount of functionality for navigating the intermediate versions that a software system has during its evolution. This leads us to a major challenge of software evolution research based on SCMs, namely how to tackle data size and complexity. Raw repository information is too large and provides directly just limited insight in the evolution of a software project. Extra analysis is needed to process these data and extract relevant evolution features. In this paper we address the challenges of software evolution assessment in CVS and Subversion repositories. We propose an open framework for data extraction and analysis. We illustrate the capabilities of this framework with a customized implementation. The basic questions we try to answer using this framework are: How to deal with the large size of repository data and various limitations of textual feedback? How to extract logical coupling information from evolution? How to efficiently present evolution data to users to enable correlations across entire projects? How to get insight of a system evolution at different levels of detail of the software? The structure of this paper is as follows. In section 2 we review existing repository data extraction methods and software evolution analysis techniques. Section 3 presents our flexible interface with software repositories. Section 4 describes a new clustering technique for detecting logical coupling of files based on evolution similarity. Section 5 describes a visual back-end for evolution assessment and demonstrates our entire framework in five experiments performed on four different large-scale, real-world repositories. Section 6 summarizes our contribution and outlines open issues for future research.

3 Visual Querying and Analysis of Large Software Repositories 3 2 Background The huge potential of the data stored in SCMs for empirical studies on software evolution has been recently acknowledged. The growth in popularity and use of SCM systems, e.g., the open source CVS [7] and Subversion [20], opened new ways for project accounting, auditing and understanding. So far, the research efforts to support these developments can be grouped in two directions: data mining and data visualization. Data mining focuses on extracting and processing relevant information from SCM systems. SCM systems have not been designed to support empirical studies, so they often lack direct access to high-level, aggregated evolution information. Hence, information is distilled from the raw stored data by data mining tools. In the following, we review a number of the prominent results in this field. Fischer et al. [10] extend the SCM evolution data with information on file merge points. Gall [12] and German [14] use transaction recovery methods based on fixed time windows. Zimmermann et al. [29] extended this work with sliding windows and facts mined from commit s. Ball analyzes cohesion of classes using a mined probability of classes being modified together [1]. Bieman et al. [3] and Gall et al. [12] also mine relations between classes based on change similarities. Ying et al. [27] and Zimmermann et al. [29, 28] address relations between finer-grained artifacts, e.g., functions. Lopez-Fernandez et al. [17] apply general social network analysis methods on SCM data to assess the similarity and development process of large projects. All in all, the above methods provide several types of distilled information from the raw facts, thereby trying to answer different questions, such as: which software entities evolved together, which are the high and low activity areas of a project, how is the developers network structured, and how do a number of quality metrics, such as maintainability or modularity, change during a project s evolution. In the data mining step, an information size reduction often takes place, as only the metrics and facts relevant to the actual question are examined further. However useful, this approach can discard important facts in the data which are not directly reflected by the mining process, but which can give useful collateral insight. Data visualization, the second research direction, takes a different path, focusing on making the large amount of evolution information effectively available to users. Visualization methods make few assumptions on the data - the goal is here to let users discover patterns and trends rather than coding these in the mining process. SeeSoft [8], one of the earliest works in this direction, is a line-based code visualization tool which uses color to show code snippets matching given modification requests. Augur [11] visually combines project artifact and activity data at a given moment. Xia [26] uses treemap layouts for software structure, colored to show evolution metrics, e.g., time and author of last commit and number of changes. Such tools successfully show the structure of software systems and the change dependencies at given moments. Yet, they don t give insight into code attributes and structure changes made throughout an entire project. A first step in the direction of getting insight into several versions, UNIX s gdiff and Windows Win- Diff tools display code differences between two versions of a file by showing line

4 4 Lucian Voinea, Alexandru Telea insertions, deletions, and edits computed by the diff tool. Still, such tools cannot show the evolution of thousands of files and hundreds of versions. To overcome these shortcomings, Collberg et al. [6] depicts the evolution of software structures and mechanisms as a sequence of graphs, for medium-size projects. Lanza [16] depicts the evolution of object-oriented software systems at class level. Wu et al. [25] visualize the evolution of entire projects at file level and emphasize the evolution moments. Burch et. al [5] propose a framework for visual mining of both evolution and structure at various levels of detail. SoftChange [13] is a first attempt in this direction for a coherent environment to support the comparison of Open Source projects, targeting CVS, project mailing lists, and bug report databases. SoftChange focuses mainly on data extraction and analysis, aiming to be a generic foundation for building evolution visualization tools. Finally, our own work provided software evolution visualizations at several granularity levels: CVSscan [24] for the line-level evolution of a few source code files and CVSgrab [21] for filelevel, project-wide evolution investigations. Data extraction is a less detailed, yet very important, aspect of software evolution analysis. Subversion offers a standard API for accessing the software repository. However, Subversion is a relatively new SCM system. It appeared in June 2000 and it was first managed using CVS. This reflects in a lower number of large size repositories freely available for investigations. A large part of the open source community still uses CVS as the primary SCM system. Therefore, a large part of research in software evolution consider data from CVS repositories, e.g., [29, 10,12,13,27,17,5,24,21]. Yet, a standard framework for CVS data extraction still lacks. Two main challenges exist here for the research community: evolution data retrieval and CVS output parsing. The huge amount of data in software repositories is usually available over the Internet. On-the-fly retrieval is not suited for interactive assessment, given the sheer data size. Storing data locally requires long acquisition times, large storage space, and consistency checks. Next, CVS output is badly suited for machine reading. Many CVS systems use ambiguous or nonstandard output formats. Attempts to address these problems exist, but are incomplete. Libraries exist that offer an application interface (API) to CVS, e.g., Java s javacvs or Perl s libcvs. However, javacvs is basically undocumented. Libcvs handles only local repositories. The Eclipse environment offers a CVS client implementation but not an API. The Bonsai project [4] offers several tools to populate a database with evolution data obtained from CVS repositories. These tools are mainly meant as a web data access package and are little documented. The best supported effort for CVS data acquisition is the NetBeans javacvs package [18], a welldocumented API with allegedly full CVS client support that parses CVS output into API-level data structures. There exist only a few tools which combine data extraction, data mining and analysis, and data visualization functionality in a single environment [13, 19]. We propose a new approach towards an integrated framework that addresses these issues. Our goal is twofold. First, we aim to leverage the power of interactive visualization techniques in the analysis of the large amounts of useful and interesting information in software repositories. Second, we aim at developing an experimenting foundation, both in terms of software tools and methodology, for practical re-

5 Visual Querying and Analysis of Large Software Repositories 5 SCM systems Subversion? Repository Repository CVS Repository Subversion access library Subversion format converter CVS client CVS format converter Generic software evolution model?? format converter Control data acquisition Data processing and enrichment tools (e.g. Bugzilla acquisition module) Evoluition DB access API Cached evolution database Content manager Evoluition DB access API Evolution analysis tools Data acquisition mediator Fig. 1 A data acquisition mediator for generic, fast and and transparent access to SCM repositories search on software evolution. Ultimately, we would like to provide end users with a complete software evolution analysis chain. Our approach is described next. 3 Data extraction Repository data extraction is a main practical problem for research on software evolution. Most repository access protocols, such as CVS and Subversion, implement only basic access functions to support the archiving process. Additionally, in the concrete case of CVS, navigation commands do not have a machine-readable output. Navigation feedback is given in a compiled text format that is not always easy to decipher. Parsing this output often fails on some repositories due to awkward local conventions, e.g., date format is yyyy-mm-dd and not dd/mm/yyyy or file names may contain spaces. This makes uniform access to CVS data difficult. In such cases, one usually searches a parser that copes with the output format at hand and tries to add it to the experimental setup. We propose next an approach towards repository data access that simplifies this process using a data acquisition mediator (see Figure 1). The mediator is an easy-to-customize layer between repositories, data acquisition, processing and analysis tools. When format inconsistencies occur between the CVS output and a parser, we don t need a new CVS data acquisition tool. Instead, we adapt the mediator with a simple rule to transform the new format into

6 6 Lucian Voinea, Alexandru Telea the one accepted by the tool. While this does not completely remove the problems of inconsistent output formatting, it is a flexible way to solve problems without removing the preferred data acquisition tool. We developed an open source, easy to customize mediator, in a simple to use programming language: Python. The mediator mainly consists of a set of scripts which act as data filters, reading the raw data produced by the SCM tool (e.g. CVS client), applying data and format conversion, and outputting a filtered data stream. Filters can be connected in a dataflow-like manner to yield the desired data acquisition tool. Several filter pipelines can be added to work in parallel and have their outputs merged at the end. This enables alternative parsing strategies, e.g. in case one does not know beforehand how to detect the data format used. A second function of the mediator is to offer selective access to CVS repositories by retrieving only information about a desired folder or file, as demanded by the data analysis or visualization tool further in the pipeline. The mediator also caches the retrieved information locally, using a custom developed database. This design lets one control the trade-off between latency, bandwidth and storage space in the data acquisition step as desired. 4 Mining evolutionary coupling Raw repository data is too large and low-level to provide insight in the evolution of software projects. Extra analysis is needed to extract relevant evolution aspects. An important analysis use-case is to identify artifacts that have similar evolution. Several approaches exist for this task [3, 12, 27, 29]. All these approaches use similarity measures based on recovered repository transactions, i.e., sets of files committed by a user at some moment. The assumption is that related files have a similar evolution pattern, and thus their revisions will often share the same repository transaction. This information about correlated files is used to predict future changes in the analyzed system, from the perspective of a given artifact. We propose a more general approach. We argue that not transactions, but pure commit moments, are important for finding similar files. Transaction-based similarity measures fail to correlate files developed by different authors and with different comments attached, but which are still highly coupled. To handle such cases, we propose a similarity measure using the time distance between commit moments. If S 1 = {t i i = 1..N} are the commit moments for a file F 1 and S 2 = {t j j = 1..M} the commit moments for F 2, we define the similarity between F 1 and F 2 as the symmetric sum: Φ(F 1, F 2 ) = N i=1 1/ min( t i t j t j S 2, t i t j < k) M j=1 1/ min( t j t i t i S 1, t j t i < k) + 1 (1) where k is a customizable neighborhood factor intended to reduce the influence of completely unrelated events on the similarity measure. The square root is meant to attenuate the influence of the network latency on the transaction time recorded by the repository. Intuitively, this measure considers, for each commit moment of

7 Visual Querying and Analysis of Large Software Repositories 7 F 1, the closest commit moment from F 2, weighted by the inverse time distance between the two moments. We next use this measure in an agglomerative clustering algorithm to group files that have a similar evolution. The clustering works in a bottom-up fashion, as follows. First, all individual files are placed in their own cluster. Next, the two most similar clusters, as given by the above similarity metric, are found and merged in a parent cluster, which gets the two original clusters as children. The process is repeated until a single root cluster is obtained. To measure the similarity of nonleaf clusters, we use Equation 1 on the union of commit moments of all files in the first, respectively second, cluster. This is equivalent to a full linkage clustering which is computationally more expensive than other techniques such as single linkage or k means [9]. To give a practical estimation of the efficiency, clustering a system containing around 850 versions of 2000 files, using our evolutionary coupling implemented in C++, took just under 5 minutes on a 2.5 GHz Windows PC. After the clustering tree is computed, we use it to visualize a system evolution decomposition by displaying those clusters on the level of detail selected by the user interactively. This technique is presented along our other visualization tools in the next section. 5 Visualization Visualization tries to give insight in large and complex evolution data by delegating the pattern detection and correlation making to the human visual system. Visualization can also present the results of data analysis in an intuitive, readyto-use, way. Visualization is a main ingredient of our software repository mining framework. We present next a set of experiments that illustrate the applicability of our framework on real-life, industry-size software repositories. To this end we use the evolutionary coupling computation technique proposed in the previous section and the CVSgrab tool [21] as our visualization back-end. The evolutionary coupling algorithm presented in Section 4 is connected to the visualization back-end through the data acquisition mediator described in Section 3 to form our complete evolution analysis framework. We first outline the functionality of the CVSgrab tool. For more details, we refer to [21]. CVSgrab is an interactive tool that depicts project evolution at file level. A complete project, as present in a SCM repository, is rendered as a set of horizontal strips. Each strip shows a file s evolution in time (Figure 2). The x axis shows the time. The file strips are stacked along the y axis following different layouts, which can be interactively set up in CVSgrab to suit different types of analyses. Color can be used to show several attributes: file type, size, author, metrics computed on the file contents, or other information such as the occurrence of a specific word in the file s commit log or information from an associated bug database. All these mechanisms are presented in the next sections, by illustrating how they can be used to assess several aspects of software evolution.

8 8 Lucian Voinea, Alexandru Telea F 1 Time V 1 V 2 V 3 V 4 F 2 F 3 F 4 Files Color encodes version based file metrics Fig. 2 CVSgrab visualization of project evolution 5.1 Case 1: Assessment of Software Metrics Evolution We analyzed the evolution of ArgoUML, an object-oriented design tool with a 6- year evolution of 4452 files developed by 37 authors. This project is hosted in a CVS repository. Data acquisition took 31 minutes over a T1 Internet connection: 8 minutes for the initial setup (i.e., one-time retrieval of the last version of 59MB) and 23 minutes to retrieve the evolution data to be visualized (20MB). In Figure 3, a 12-snapshot matrix shows ArgoUML s evolution, displayed using CVSgrab. Each column shows the evolution of one metric: Column 1 shows the development team evolution. Color shows the ID of the user who committed each file. Column 2 shows the file size evolution, measured as changed lines of code (LOC). Hue encodes change type: Gray shows the first commit of a file, blue shows file size increase, red is decrease, and yellow is a commit that changes content but not size. Brightness encodes change size: Lighter colors show smaller changes, darker colors show larger changes. Column 3 shows file type: red for java sources, green for images, and yellow for HTML files. Column 4 shows files that contain a given string in their commit comment: green are files that contain the word fix, blue are files that contain the word error. Each matrix row in Figure 3 uses another layout on the y axis. In row A, files are sorted alphabetically on their full path, and thus show the folder structure. In row B, files are sorted from top to bottom in decreasing order of number of versions, i.e., on file activity. Files that have the same number of versions are further sorted in decreasing order of creation time. In row C, files are sorted in decreasing order of creation time. Files created in the beginning of the project are at the bottom, while young files are at the top. Figure 3 shows several interesting aspects of the process and organization of ArgoUML. Cell C3 shows that the project started with a documentation base (green and yellow files) that probably contained the system specification. This was contributed by one user (jrobbins = brown in C1) and remained unchanged for the entire project duration except for a large addition (dark blue in C2) done by another user two years later (dennyd = red in C1). The added code concerned an underspecified issue as it was extend again two years later (dark blue in C2) by another author (mvw = cyan in C1). The real implementation first appeared 6 months

9 Visual Querying and Analysis of Large Software Repositories 9 Sort alphabetically A Sort by creation time Sort by activity B C 1: team evolution 2: size evolution 3: file type evolution 4: search evolution Fig. 3 ArgoUML metrics evolution visualization with CVSgrab after the specification was committed (java source files = red in C3) and was contributed by one author (1sturm = blue in C1). Two years from the project start, another big documentation chunk was added (yellow and green in C3) by one user (jeremybennett = yellow in C1). Although the specification, implementation, and documentation appear to be the work of one author each, it is intriguing that they were all committed at one time and not incrementally, by one author, and contained many (around 400) files. We believe that this denotes work of more people, which was first checked in bulk by one single person. For the rest of the project, one user has a major contribution (linus = green in C1), with one exception in the fifth project year (mvw = cyan in C1). The large oval (1) in A1 shows that mvw (cyan in A1) made a large contribution (dark blue, large oval (1) in A2) to implementation (red in A3). The same pattern can be seen following the large ovals (1) in B1 and B2. A3 shows that the project has a very clean organization. The major color groups correspond to the folders documentation (green at the top), src_new (red at the middle) and www (yellow and green at the bottom). From B3, one can see that most activity during the project was related, as expected, to changes in the implementation files (red at the top) followed by changes in the documentation (yellow in the middle) and in the documentation images (green at the bottom). B2 shows that almost one-third of the files added during the project did not change during all six years (as they are grey). Most such files are documentation (yellow and green in B3). To this group belongs also the largest part of the previously identified system specification (brown in B1, by correlation with C1 and C3). Another interesting aspect is shown in the small ovals (2) in A1 and A4. In the fourth project year, linus made a significant contribution, not as size (no big size

10 Lucian Voinea, Alexandru Telea Sort alphabetically 10 1 A Sort by activity 2 Sort by creation time B : team evolution 2: size evolution 3: file type evolution C 4: search evolution Fig. 4 PostgreSQL metrics evolution visualization with CVSgrab change pattern seen in A2) but in terms of code cleaning. Many implementation files (red in A3) containing the words fix and error in their commit comment have been committed by linus. The same pattern can be seen in row B. The large green (i.e., fix ) horizontal pattern visible in column 4 corresponds to an initial checkout of documentation files. It suggests that previous work has been done in that area without being committed. Figure 3 shows also the project size almost never decreased. One exception, highlighted in C2, shows a size drop for documentation files (yellow in C3) that occurred at the end of the fourth year. Figure 4 shows a second example of using our framework to assess software evolution. It shows the evolution of PostgreSQL, an object-relational database management system hosted on a CVS repository, with a history of 10 years, 2829 files, and 27 authors. The data acquisition step took 28 minutes: 7 minutes for the initial setup (i.e., one time retrieval of the last project version = 56MB) and 21 minutes for retrieving the evolution information to be visualized (29MB). The evolution retrieval time was now smaller than in our first example, even though more data was retrieved. This is explained by the connection overhead. When retrieving evolution data, the connection has to be established for each file. In the second example, there are less files than in the first example, which decreased the overall connection latency. Figure 4 shows a 12-snapshot matrix of the evolution of PostgreSQL, structured similarly to Figure 3. Columns show the development team (1), size evolution (2), file type (3) and string search (4) encoded by colors, just as in example 1, except for file type and string search. In column 3, C source files are blue, C headers are light green, SGML documentation files are dark green, SQL files are

11 Visual Querying and Analysis of Large Software Repositories 11 pink, and test support files are red. In column 4, green shows versions that contain the word fix in their associated commit comment, and red versions that contain the word bug. The matrix rows use different sortings for files on the y axis: alphabetical (A), number of revisions (B) and creation time (C). Using Figures 4 and 3, we can compare the evolution of PostgreSQL and ArgoUML. Cell C3 (Figure 4) shows that the project started with a source code base (blue at bottom) and not with a specification, as for ArgoUML. Even the header files (light green) were not committed until several months later. As for ArgoUML, the initial contribution (about 400 C sources and headers) was done by one person (scrappy = red in C1). This suggests previous development unrecorded in CVS. The project is contributed by a few authors: momjian (light green), tgl (dark blue), pgsql (magenta), petere (cyan), thomas (yellow-green). The commits of momjian and tgl are interleaved at intervals of around 6-8 months (column 1) and address the most active parts of the system (B1). These parts correspond to the C sources and headers, by correlation via A1 and A3. These parts are also modified by pgsql in the last two project years. A detailed look at B1 and B2 reveals the commit patterns of momjian and tgl. momjian s commits do not usually change file sizes (yellow in B2) and are done at large intervals. In contrast, tgl s commits are done more often and often change file sizes. The contributions of momjian interrupt abruptly the ones of tgl but not conversely. This suggests the real work might be done by tgl while momjian has more the role of a code standard manager. By zooming in on the files contents, we saw that the changes done by momjian were mainly indentation and copyright text. Finally, petere and thomas mainly contributed to the documentation (by correlating A1 and A3). As for ArgoUML, PostgreSQL has a clean organization (A3): source, header, documentation, and test files are well separated. Most activity occurs in the implementation (C) files, but also in headers and documentation files. This suggests frequent architectural changes. No large size changes are seen throughout the project, except documentation, (label 1 in A2). Finally, column 4 in Figure 4 shows the distribution of the words fix and bug in commit logs. The green patterns (2) show versions containing the word fix. By correlating C3, C4 and C1, we saw that these patterns match header files committed by momjian. Hence, we thought they would contain cosmetic changes. Indeed, file-by-file analysis showed that fix refers actually to the indentation tool and not to the system code itself. Other occurrences of the word fix are evenly spread mainly over the evolution of C sources (A4). The red patterns (3 in A4 and C4) show versions containing the word bug. These are test files and appear close to the file creation moment. This, together with the fact that test files appear early in the project, suggests an active test policy. The rest of the occurrences of the word bug are evenly distributed, mostly over of C sources. 5.2 Case 2: Assessment of Evolutionary Coupling One question we were interested in was which are the components of PostgreSQL which show similar evolution patterns. To answer this, we computed a cluster hierarchy based on evolutionary coupling, as explained in Section 4. Figure 5 shows

12 12 Lucian Voinea, Alexandru Telea the clustering using the CVSgrab tool. Files in the same cluster are drawn using the so-called plateau cushions, dark at the edges and bright in the middle[21]. In each cluster, files are sorted on the y axis in inverse order of their creation time. Color shows different metrics: word distribution (1), size evolution (2), file author (3) and file type (4), just as in Figure 4. We see five main important evolution groups. Column 4 shows three main groups: sources (a,b,c), documentation (d), and test scenario files (e). We can easily see that source files appearing in the beginning of the project evolve similarly (b). These may refer to a subsystem having a high logical coupling which can be seen as a building block or component. The same holds for the other two source code clusters (a,c). All these clusters share the same developer team (3) and size evolution patterns (2). The early introduced source code cluster (b) has, however, more versions whose commit logs contain the word bug (marked in 1). This suggests a problematic implementation in this cluster. Documentation forms a separate cluster (d), which suggests it does not target the system detailed design, as it doesn t change in sync with the headers. Finally, the large cluster at the bottom of the images (e) contains mainly test scenarios, so it nicely factors out files intended to support the development process, rather than implementing actual functionality. Changing the level of detail, we can further split the above clusters for a finer analysis. This can be useful not only to obtain a logical decomposition, but also to predic future changes with different levels of confidence, similar to [29]. d a b c Main clusters e 1: word distribution 2: size evolution 3: file author 4: file type Fig. 5 PostgreSQL evolution clusters visualization with CVSgrab 5.3 Case 3: Assessment of Source Code Style A frequent question that appears when assessing for the first time a third party software stack is: How does the source code of a given project look like? We chose the KDE KOffice project as a case study in this direction. We used the Subversion module of the the data acquisition mediator to retrieve the entire KDE KOffice project history (10616 files, 37 authors, 9 years of evolution) and cache it locally. This operation took 30 minutes, which is several times faster than a similar action on a CVS repository for two reasons. First, Subversion allows retrieving the list of files directly from the repository. There is no need for the initial CVS setup step

13 Visual Querying and Analysis of Large Software Repositories 13 where the final versions of all files are retrieved prior to history information. Secondly, Subversion offers a compact, thus network-friendly, output format. After acquisition, we used CVSgrab s filter functions to select the 108 C++ source and header files from the KOfficeui folder for detailed investigation. We split the files into source and header using the sort-by-type feature of CVSgrab and then sorted each group in the increasing order of creation time. Finally, we parsed the files using a custom lightweight parser, computed the amount of code and comment lines, and visualized these using color. Figure 6 top shows the size evolution using a rainbow color map. Most implementation files have over 90 LOC (red). There are two files with less then 10 LOC (blue). The source files seem to stay quite constant in size, while their number steadily increases. This signals addition of new features as opposed to maintenance or refactoring. In contrast, most headers are up to 40 LOC. One larger header was simplified (B). At one point (A), an interface refactoring took place. A lot of code from the headers was systematically removed across the entire library, yet the implementation stayed constant (as there is no drop in the sources size). Figure 6 bottom uses a rainbow color map to show the comments evolution. Both headers and sources have a similar amount of comments, which denotes good coding style. The amount of comments increases in a few sources at one point (C), while the size stays constant, which suggests a code review and documentation addition. 5.4 Case 4: Assessment of Change Propagation during Debugging An important question in maintenance is whether debugging-induced changes propagate system-wide or are localized to subsystems. The latter situation is desirable, as it allows modular, simpler, and ultimately cheaper maintenance. We studied this question on the Mozilla Firefox code (659 files, 120 authors, 5 years of evolution). We first acquired the Mozilla evolution data from the official CVS site, which took 30 minutes including the initial setup step. Next, we implemented a custom data acquisition module to extract related bug data from the project s Bugzilla database. We merged the history and bug acquisition modules in the mediator s data pipeline and finally loaded all the information into the visualization. We used small red square markers, as opposed to colors, to show the time and files where debugging activities related to the so-called blocker bugs occur (Figure 7 top). Each marker is surrounded by a transparent disk of constant radius. This allows bugs located close to each other in time and project space to blend together, which reveals a debugging-intensive area. Figure 7 top shows a red area of high density indicating intensive debugging activity in the recent past. Looking at folder names, we saw that this area almost perfectly matches the /components/places/content folder (colored in green in Figure 7 middle). Consequently, we wanted to know which other files evolve similarly to these intensively debugged files. For this, we clustered the project based on evolution similarity (Section 4). Figure 7 bottom shows a zoomed-in view of the cluster containing files from the /components/places/content folder. The found

14 14 Lucian Voinea, Alexandru Telea.cpp B A.h C.cpp.h Fig. 6 Source code evolution in KDE Koffice. Increasing number of constant size sources suggests new feature addition as opposed to maintenance. A complex header file is simplified (B), size of headers drops at one moment due to an interface refactoring (A)(top). Number of comment lines is similar in implementation and header files, some implementation files become more commented possibly after a code review (C) (bottom)

15 Visual Querying and Analysis of Large Software Repositories 15 Fig. 7 Assessment of change propagation during debugging in Firefox. Red spots show debugging activities addressing blocker bugs (top). Green marks the folder containing the most intense debugging activity (middle). These files form an apart evolution cluster (bottom), so changes in the debugging zone stay localized.

16 16 Lucian Voinea, Alexandru Telea cluster clearly contain only files from the investigated folder (green). Hence, debugging activities are nicely matching the folder structure, and do not propagate to other system parts. 5.5 Case 5: Assessment of Architecture Stability Another question in software evolution is how the code structure changes during a project s lifetime. In this case study, we look at how stable function interfaces are. We used the same process as in case 3 to acquire the history of KDE KOffice. Next, we focused on the Libs folder which contains C/C++ source files and their interface headers. Next, we integrated a custom C/C++ parser as a filter into our data acquisition mediator, and used it to extract the names of the functions implemented in the Libs folder. This operation took one hour. The resulting data was visualized with CVSgrab. We focused next on those interface functions which are implemented in several source files. These are typically overriden methods of some C++ base class. For each such function, we visualize when do the files which implement it change, by coloring those file versions with the function s color ID (Figure 8). When several functions are selected for inspection and they all change in the same version, we color that version red. For some interfaces, e.g. the draw method, we see that the Fig. 8 Assessment of interface stability in KDE KOffice. Files implementing the name method (blue) do not change together, i.e., the associated interface is stable (top). Files implementing the draw method (green) change together, so the associated interface is less stable (bottom)

17 Visual Querying and Analysis of Large Software Repositories 17 files that implement it have many moments when they change together, as shown in Figure 8 bottom by the vertical stripes. We next used the details-on-demand mechanism of CVSgrab to investigate the associated commit messages of those files that changed together. In some cases, the changes addressed textual issues such as indentation (4 in Figure 8 bottom). These situations are not interesting for analysis. However, there were also cases in which the change referred to an issue of the interface, which propagated to all the files that implemented it (1,2 and 3 in Figure 8 bottom). Such patterns indicate an unstable interface. The most favorable case is when changes in the files implementing the interface are uncorrelated (Figure 8 top). 6 Conclusions We presented a new framework for visual data mining and analysis of software repositories. Our goal is twofold. First, we use several novel visualization techniques to analyze the large amounts of information from software repositories. Our techniques are simple: 2D Cartesian layouts, color showing attributes, sorting on any attribute, clustering based on evolutionary similarity, and bugs displayed as blended areas. However, their power resides precisely in this simplicity. Combined with their genericity, i.e. the fact that any technique can be used on any attribute, and driven by interactive selection and zooming mechanisms, this yields a versatile and effective visualization environment. Adding the generic dataflow-driven filter mechanism of the data acquisition mediator yields a complete framework for evolution data acquisition, mining and visualization. Our second goal was to develop an experimenting foundation, both in software tools and methodology, for research on software evolution. Our methodology is illustrated by means of five different case studies, covering analysis of developer network, software interfaces, code style, system modularity, and debugging activities, over four industry-size Open Source projects: ArgoUML, PostgreSQL, Mozilla Firefox and KDE Koffice. We stress that the examples presented here are simplified scenarios for space constraints. We have used our framework together with actual developers of commercial software, as well as with developers of open source libraries in several more involved activities targeting more complex questions. The feedback collected revealed a methodology very similar to the one presented in the scenarios here. Moreover, our framework competed with success in several software repository mining and visualization challenges [22,23]. A strong point revealed in all these situations was the simplicity of retrieving repository data, supported by the flexible data acquisition mediator design. We also propose a new technique for computing evolutionary coupling. This technique assists users both to perform a logical decomposition of the system, and to predict future changes in the system from the perspective of select files. This technique is completely independent on the file contents, which is a strong advantage in terms of robustness, simplicity, and genericity. We aim next to improve the similarity measure of the evolutionary coupling mechanism by using additional attributes, e.g., file type. The challenge here is to find the best similarity description that matches a given user requirement. Additionally, we would like to extend

18 18 Lucian Voinea, Alexandru Telea the framework with other generic visualization and code analysis mechanisms, to support more complex and also more specific questions. References 1. T. Ball, J. Kim, A. Porter, and H. Siy. If your version control system could talk... In Proc. Workshop on Process Modeling and Empirical Studies of Software Engineering, K. Bennett, E. Burd, C. Kemerer, L. M. M., M. Lee, R. Madachy, C. Mair, D. Sjoberg, and S. Slaughter. Empirical studies of evolving systems. Empirical Software Engineering, 4(4): , J. M. Bieman, A. Andrews, and H. Yang. Understanding change-proneness in oo software through visualization. In Proc. IWPC 03, pages IEEE Press, Bonsai, www M. Burch, S. Diehl, and P. Weißgerber. Visual data mining in software archives. In SoftVis 05: Proceedings of the 2005 ACM symposium on Software visualization, pages 37 46, New York, NY, USA, ACM Press. 6. C. Collberg, S. Kobourov, J. Nagra, J. Pitts, and K. Wampler. A system for graph-based visualization of the evolution of software. In Proc. ACM SoftVis 03, pages ACM Press, CVS, www S. G. Eick, J. Steffen, and E. Sumner. Seesoft a tool for visualizing line oriented software statistics. IEEE Trans. Soft. Eng., 18(11): , E. Everitt, S. Landau, and M. Leese. Cluster Analysis. Arnold Publishers, Inc., M. Fischer, M. Pinzger, and H. Gall. Populating a release history database from version control and bug tracking systems. In Proc. ICSM, pages IEEE Press, J. Froehlich and P. Dourish. Unifying artifacts and activities in a visual tool for distributed software development teams. In Proc. ICSE 04, pages IEEE Press, H. Gall, M. Jazayeri, and J. Krajewski. Cvs release history data for detecting logical couplings. In Proc. IWPSE, pages IEEE Press, D. German, A. Hindle, and N. Jordan. Visualizing the evolution of software using softchange. In Proc. 16th Intl. Conf. on Software Engineering and Knowledge Engineering, pages , D. German and A. Mockus. Automating the measurement of open source projects. In Proc. Workshop on Open Source Software Engineering (OOSE), R. M. Greenwood, B. Warboys, R. Harrison, and P. Henderson. An empirical study of the evolution of a software system. In Proc. 13 th Conf. on Automated Software Engineering (ASE), pages , M. Lanza. The evolution matrix: recovering software evolution using software visualization techniques. In Proc. IWPSE 01, pages ACM Press, L. Lopez-Fernandez, G. Robles, and J. Gonzalez-Barahona. Applying social network analysis to the information in cvs repositories. In Proc. Intl. Workshop on Mining Software Repositories (MSR). IEEE Press, NetBeans.javacvs, www M. Pinzger, H. Gall, M. Fischer, and M. Lanza. Visualizing multiple evolution metrics. In SoftVis 05: Proceedings of the 2005 ACM symposium on Software visualization, pages 67 75, New York, NY, USA, ACM Press. 20. SVN, www.

19 Visual Querying and Analysis of Large Software Repositories L. Voinea and A. Telea. CVSgrab: Mining the history of large software projects. In Proc. EuroVis 06, pages IEEE Press, L. Voinea and A. Telea. How do changes in buggy mozilla files propagate? In SoftVis 06: Proceedings of the 2006 ACM symposium on Software visualization, pages , New York, NY, USA, ACM Press. 23. L. Voinea and A. Telea. Mining software repositories with cvsgrab. In MSR 06: Proceedings of the 2006 international workshop on Mining software repositories, pages , New York, NY, USA, ACM Press. 24. L. Voinea, A. Telea, and J. van Wijk. Visualization of code evolution. In Proc. ACM SoftVis 05, pages ACM Press, J. Wu, C. Spitzer, A. Hassan, and R. Holt. Evolution spectrographs: Visualizing punctuated change in software evolution. In Proc. of IWPSE 04, pages IEEE, X. Wu. Visualization of version control information. Master s thesis. University of Victoria, A. Ying, G. Murphy, R. Ng, and M. Chu-Carroll. Predicting source code changes by mining revision history. IEEE Trans. on Soft. Eng., 30(9): , T. Zimmermann and Weisgerber. Preprocessing CVS data for fine-grained analysis. In Proc. Intl. Workshop on Mining Software Repositories (MSR). IEEE Press, T. Zimmermann, P. Weisgerber, S. Diehl, and A. Zeller. Mining version histories to guide software changes. In Proc. ICSE 04, pages IEEE Press, 2004.

A File Based Visualization of Software Evolution

A File Based Visualization of Software Evolution A File Based Visualization of Software Evolution Keywords: Software evolution, Software visualization. S.L. Voinea l.voinea@tue.nl A. Telea alext@win.tue.nl Technische Universiteit Eindhoven Abstract Software

More information

Visual Assessment of Software Evolution

Visual Assessment of Software Evolution Visual Assessment of Software Evolution L. Voinea, J.J. Lukkien, A. Telea Department of Computer Science, Technische Universiteit Eindhoven, The Netherlands Abstract Configuration management tools have

More information

Visualizing the evolution of software using softchange

Visualizing the evolution of software using softchange Visualizing the evolution of software using softchange Daniel M. German, Abram Hindle and Norman Jordan Software Engineering Group Department of Computer Science University of Victoria dmgerman,abez,njordan

More information

Software Trend Analyzer User Manual

Software Trend Analyzer User Manual Software Trend Analyzer User Manual for SolidSTA v1.3 March 2010 P a g e 3 Contents 1 Introduction... 5 1.1 Supported configurations... 5 2 Installation... 5 3 Main functions... 6 3.1 GUI layout... 6

More information

Citation for published version (APA): Voinea, L., & Telea, A. (2007). Visual data mining and analysis of software repositories. Default journal.

Citation for published version (APA): Voinea, L., & Telea, A. (2007). Visual data mining and analysis of software repositories. Default journal. University of Groningen Visual data mining and analysis of software repositories Voinea, Lucian; Telea, Alexandru Published in: Default journal IMPORTANT NOTE: You are advised to consult the publisher's

More information

Version-Centric Visualization of Code Evolution

Version-Centric Visualization of Code Evolution Version-Centric Visualization of Code Evolution S.L. Voinea, A. Telea and M. Chaudron Technische Universiteit Eindhoven, The Netherlands Abstract The source code of software systems changes many times

More information

Reverse Engineering with Logical Coupling

Reverse Engineering with Logical Coupling Reverse Engineering with Logical Coupling Marco D Ambros, Michele Lanza - Faculty of Informatics - University of Lugano Switzerland 13th Working Conference on Reverse Engineering October 23-27, 2006, Benevento,

More information

Software Bugs and Evolution: A Visual Approach to Uncover Their Relationship

Software Bugs and Evolution: A Visual Approach to Uncover Their Relationship Software Bugs and Evolution: A Visual Approach to Uncover Their Relationship Marco D Ambros and Michele Lanza Faculty of Informatics University of Lugano, Switzerland Abstract Versioning systems such as

More information

Churrasco: Supporting Collaborative Software Evolution Analysis

Churrasco: Supporting Collaborative Software Evolution Analysis Churrasco: Supporting Collaborative Software Evolution Analysis Marco D Ambros a, Michele Lanza a a REVEAL @ Faculty of Informatics - University of Lugano, Switzerland Abstract Analyzing the evolution

More information

Improving Evolvability through Refactoring

Improving Evolvability through Refactoring Improving Evolvability through Refactoring Jacek Ratzinger, Michael Fischer Vienna University of Technology Institute of Information Systems A-1040 Vienna, Austria {ratzinger,fischer}@infosys.tuwien.ac.at

More information

3 Analysing Software Repositories to Understand Software Evolution

3 Analysing Software Repositories to Understand Software Evolution 3 Analysing Software Repositories to Understand Software Evolution Marco D Ambros 1, Harald C. Gall 2, Michele Lanza 1, and Martin Pinzger 2 1 Faculty of Informatics, University of Lugano, Switzerland

More information

Towards a Taxonomy of Approaches for Mining of Source Code Repositories

Towards a Taxonomy of Approaches for Mining of Source Code Repositories Towards a Taxonomy of Approaches for Mining of Source Code Repositories Huzefa Kagdi, Michael L. Collard, Jonathan I. Maletic Department of Computer Science Kent State University Kent Ohio 44242 {hkagdi,

More information

Visualizing and Characterizing the Evolution of Class Hierarchies

Visualizing and Characterizing the Evolution of Class Hierarchies Visualizing and Characterizing the Evolution of Class Hierarchies Tudor Gîrba and Michele Lanza Software Composition Group University of Berne Switzerland {girba, lanza}@iam.unibe.ch Abstract Analyzing

More information

Mining CVS repositories, the softchange experience

Mining CVS repositories, the softchange experience Mining CVS repositories, the softchange experience Daniel M. German Software Engineering Group Department of Computer Science University of Victoria dmgerman@uvic.ca Abstract CVS logs are a rich source

More information

A Source Code History Navigator

A Source Code History Navigator A Source Code History Navigator Alexander Bradley (awjb@cs.ubc.ca) CPSC 533C Project Proposal University of British Columbia October 30, 2009 1 Problem Description This project will address a problem drawn

More information

Empirical Study on Impact of Developer Collaboration on Source Code

Empirical Study on Impact of Developer Collaboration on Source Code Empirical Study on Impact of Developer Collaboration on Source Code Akshay Chopra University of Waterloo Waterloo, Ontario a22chopr@uwaterloo.ca Parul Verma University of Waterloo Waterloo, Ontario p7verma@uwaterloo.ca

More information

Measuring fine-grained change in software: towards modification-aware change metrics

Measuring fine-grained change in software: towards modification-aware change metrics Measuring fine-grained change in software: towards modification-aware change metrics Daniel M. German Abram Hindle Software Engineering Group Department fo Computer Science University of Victoria Abstract

More information

How Often and What StackOverflow Posts Do Developers Reference in Their GitHub Projects?

How Often and What StackOverflow Posts Do Developers Reference in Their GitHub Projects? How Often and What StackOverflow Posts Do Developers Reference in Their GitHub Projects? Saraj Singh Manes School of Computer Science Carleton University Ottawa, Canada sarajmanes@cmail.carleton.ca Olga

More information

Reverse Engineering with Logical Coupling

Reverse Engineering with Logical Coupling Reverse Engineering with Logical Coupling Marco D Ambros, Michele Lanza - Faculty of Informatics -! University of Lugano! Switzerland! 13th Working Conference on Reverse Engineering October 23-27, 2006,

More information

Published in: 13TH EUROPEAN CONFERENCE ON SOFTWARE MAINTENANCE AND REENGINEERING: CSMR 2009, PROCEEDINGS

Published in: 13TH EUROPEAN CONFERENCE ON SOFTWARE MAINTENANCE AND REENGINEERING: CSMR 2009, PROCEEDINGS University of Groningen Visualizing Multivariate Attributes on Software Diagrams Byelas, Heorhiy; Telea, Alexandru Published in: 13TH EUROPEAN CONFERENCE ON SOFTWARE MAINTENANCE AND REENGINEERING: CSMR

More information

Support for Static Concept Location with sv3d

Support for Static Concept Location with sv3d Support for Static Concept Location with sv3d Xinrong Xie, Denys Poshyvanyk, Andrian Marcus Department of Computer Science Wayne State University Detroit Michigan 48202 {xxr, denys, amarcus}@wayne.edu

More information

Mining Sequences of Changed-files from Version Histories

Mining Sequences of Changed-files from Version Histories Mining Sequences of Changed-files from Version Histories Huzefa Kagdi, Shehnaaz Yusuf, Jonathan I. Maletic Department of Computer Science Kent State University Kent Ohio 44242 {hkagdi, sdawoodi, jmaletic}@cs.kent.edu

More information

A Case Study on the Similarity Between Source Code and Bug Reports Vocabularies

A Case Study on the Similarity Between Source Code and Bug Reports Vocabularies A Case Study on the Similarity Between Source Code and Bug Reports Vocabularies Diego Cavalcanti 1, Dalton Guerrero 1, Jorge Figueiredo 1 1 Software Practices Laboratory (SPLab) Federal University of Campina

More information

Published in: Proceedings of International Workshop on Principles of Software Evolution (IWPSE 2007)

Published in: Proceedings of International Workshop on Principles of Software Evolution (IWPSE 2007) University of Groningen A Small Observatory for Super-Repositories Lungu, Micea-Filip; Gîrba, Tudor Published in: Proceedings of International Workshop on Principles of Software Evolution (IWPSE 2007)

More information

(S)LOC Count Evolution for Selected OSS Projects. Tik Report 315

(S)LOC Count Evolution for Selected OSS Projects. Tik Report 315 (S)LOC Count Evolution for Selected OSS Projects Tik Report 315 Arno Wagner arno@wagner.name December 11, 009 Abstract We measure the dynamics in project code size for several large open source projects,

More information

FREQUENTLY ASKED QUESTIONS

FREQUENTLY ASKED QUESTIONS Borland Together FREQUENTLY ASKED QUESTIONS GENERAL QUESTIONS What is Borland Together? Borland Together is a visual modeling platform that enables software teams to consistently deliver on-time, high

More information

1 Version management tools as a basis for integrating Product Derivation and Software Product Families

1 Version management tools as a basis for integrating Product Derivation and Software Product Families 1 Version management tools as a basis for integrating Product Derivation and Software Product Families Jilles van Gurp, Christian Prehofer Nokia Research Center, Software and Application Technology Lab

More information

International Journal for Management Science And Technology (IJMST)

International Journal for Management Science And Technology (IJMST) Volume 4; Issue 03 Manuscript- 1 ISSN: 2320-8848 (Online) ISSN: 2321-0362 (Print) International Journal for Management Science And Technology (IJMST) GENERATION OF SOURCE CODE SUMMARY BY AUTOMATIC IDENTIFICATION

More information

The Use of Information Visualization to Support Software Configuration Management

The Use of Information Visualization to Support Software Configuration Management The Use of Information Visualization to Support Software Configuration Management Roberto Therón 1, Antonio González 1, Francisco J. García 1, Pablo Santos 2 1 Departamento de Informática y Automática,

More information

OLAP Introduction and Overview

OLAP Introduction and Overview 1 CHAPTER 1 OLAP Introduction and Overview What Is OLAP? 1 Data Storage and Access 1 Benefits of OLAP 2 What Is a Cube? 2 Understanding the Cube Structure 3 What Is SAS OLAP Server? 3 About Cube Metadata

More information

Commit 2.0. Marco D Ambros, Michele Lanza. Romain Robbes. ABSTRACT

Commit 2.0. Marco D Ambros, Michele Lanza. Romain Robbes. ABSTRACT Commit 2.0 Marco D Ambros, Michele Lanza REVEAL @ Faculty of Informatics University of Lugano, Switzerland {marco.dambros, michele.lanza}@usi.ch Romain Robbes University of Chile Chile romain.robbes@gmail.com

More information

Predicting Source Code Changes by Mining Revision History

Predicting Source Code Changes by Mining Revision History Predicting Source Code Changes by Mining Revision History Annie T.T. Ying*+, Gail C. Murphy*, Raymond Ng* Dep. of Computer Science, U. of British Columbia* {aying,murphy,rng}@cs.ubc.ca Mark C. Chu-Carroll+

More information

Software Visualization. Presented by Sam Davis

Software Visualization. Presented by Sam Davis Software Visualization Presented by Sam Davis 2 3 More than Just UML! UML is about static structure of software In terms of abstractions like Procedures Objects Files Packages But 4 Software is Dynamic!

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

3 Prioritization of Code Anomalies

3 Prioritization of Code Anomalies 32 3 Prioritization of Code Anomalies By implementing a mechanism for detecting architecturally relevant code anomalies, we are already able to outline to developers which anomalies should be dealt with

More information

The TDAQ Analytics Dashboard: a real-time web application for the ATLAS TDAQ control infrastructure

The TDAQ Analytics Dashboard: a real-time web application for the ATLAS TDAQ control infrastructure The TDAQ Analytics Dashboard: a real-time web application for the ATLAS TDAQ control infrastructure Giovanna Lehmann Miotto, Luca Magnoni, John Erik Sloper European Laboratory for Particle Physics (CERN),

More information

A Comparative Study on Different Version Control System

A Comparative Study on Different Version Control System e-issn 2455 1392 Volume 2 Issue 6, June 2016 pp. 449 455 Scientific Journal Impact Factor : 3.468 http://www.ijcter.com A Comparative Study on Different Version Control System Monika Nehete 1, Sagar Bhomkar

More information

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3

More information

Visualizing Co-Change Information with the Evolution Radar

Visualizing Co-Change Information with the Evolution Radar 1 Visualizing Co-Change Information with the Evolution Radar Marco D Ambros Michele Lanza Mircea Lungu REVEAL @ Faculty of Informatics - University of Lugano, Switzerland Abstract Software evolution analysis

More information

RINGS : A Technique for Visualizing Large Hierarchies

RINGS : A Technique for Visualizing Large Hierarchies RINGS : A Technique for Visualizing Large Hierarchies Soon Tee Teoh and Kwan-Liu Ma Computer Science Department, University of California, Davis {teoh, ma}@cs.ucdavis.edu Abstract. We present RINGS, a

More information

Mining Software Repositories for Software Change Impact Analysis: A Case Study

Mining Software Repositories for Software Change Impact Analysis: A Case Study Mining Software Repositories for Software Change Impact Analysis: A Case Study Lile Hattori 1, Gilson dos Santos Jr. 2, Fernando Cardoso 2, Marcus Sampaio 2 1 Faculty of Informatics University of Lugano

More information

Filtering Bug Reports for Fix-Time Analysis

Filtering Bug Reports for Fix-Time Analysis Filtering Bug Reports for Fix-Time Analysis Ahmed Lamkanfi, Serge Demeyer LORE - Lab On Reengineering University of Antwerp, Belgium Abstract Several studies have experimented with data mining algorithms

More information

BugzillaMetrics - Design of an adaptable tool for evaluating user-defined metric specifications on change requests

BugzillaMetrics - Design of an adaptable tool for evaluating user-defined metric specifications on change requests BugzillaMetrics - A tool for evaluating metric specifications on change requests BugzillaMetrics - Design of an adaptable tool for evaluating user-defined metric specifications on change requests Lars

More information

Mining Frequent Bug-Fix Code Changes

Mining Frequent Bug-Fix Code Changes Mining Frequent Bug-Fix Code Changes Haidar Osman, Mircea Lungu, Oscar Nierstrasz Software Composition Group University of Bern Bern, Switzerland {osman, lungu, oscar@iam.unibe.ch Abstract Detecting bugs

More information

Best practices for OO 10 content structuring

Best practices for OO 10 content structuring Best practices for OO 10 content structuring With HP Operations Orchestration 10 two new concepts were introduced: Projects and Content Packs. Both contain flows, operations, and configuration items. Organizations

More information

Finding a needle in Haystack: Facebook's photo storage

Finding a needle in Haystack: Facebook's photo storage Finding a needle in Haystack: Facebook's photo storage The paper is written at facebook and describes a object storage system called Haystack. Since facebook processes a lot of photos (20 petabytes total,

More information

CPSC 491. Lecture 19 & 20: Source Code Version Control. VCS = Version Control Software SCM = Source Code Management

CPSC 491. Lecture 19 & 20: Source Code Version Control. VCS = Version Control Software SCM = Source Code Management CPSC 491 Lecture 19 & 20: Source Code Version Control VCS = Version Control Software SCM = Source Code Management Exercise: Source Code (Version) Control 1. Pretend like you don t have a version control

More information

Sindice Widgets: Lightweight embedding of Semantic Web capabilities into existing user applications.

Sindice Widgets: Lightweight embedding of Semantic Web capabilities into existing user applications. Sindice Widgets: Lightweight embedding of Semantic Web capabilities into existing user applications. Adam Westerski, Aftab Iqbal, and Giovanni Tummarello Digital Enterprise Research Institute, NUI Galway,Ireland

More information

Software Engineering Principles

Software Engineering Principles 1 / 19 Software Engineering Principles Miaoqing Huang University of Arkansas Spring 2010 2 / 19 Outline 1 2 3 Compiler Construction 3 / 19 Outline 1 2 3 Compiler Construction Principles, Methodologies,

More information

Fine-grained Software Version Control Based on a Program s Abstract Syntax Tree

Fine-grained Software Version Control Based on a Program s Abstract Syntax Tree Master Thesis Description and Schedule Fine-grained Software Version Control Based on a Program s Abstract Syntax Tree Martin Otth Supervisors: Prof. Dr. Peter Müller Dimitar Asenov Chair of Programming

More information

FACETs. Technical Report 05/19/2010

FACETs. Technical Report 05/19/2010 F3 FACETs Technical Report 05/19/2010 PROJECT OVERVIEW... 4 BASIC REQUIREMENTS... 4 CONSTRAINTS... 5 DEVELOPMENT PROCESS... 5 PLANNED/ACTUAL SCHEDULE... 6 SYSTEM DESIGN... 6 PRODUCT AND PROCESS METRICS...

More information

Mining the History of Synchronous Changes to Refine Code Ownership

Mining the History of Synchronous Changes to Refine Code Ownership Mining the History of Synchronous Changes to Refine Code Ownership Lile Hattori and Michele Lanza REVEAL@ Faculty of Informatics - University of Lugano, Switzerland Abstract When software repositories

More information

An empirical study of fine-grained software modifications

An empirical study of fine-grained software modifications Empir Software Eng (2006) 11: 369 393 DOI 10.1007/s10664-006-9004-6 An empirical study of fine-grained software modifications Daniel M. German Published online: 31 May 2006 # Springer Science + Business

More information

CS5412: TRANSACTIONS (I)

CS5412: TRANSACTIONS (I) 1 CS5412: TRANSACTIONS (I) Lecture XVII Ken Birman Transactions 2 A widely used reliability technology, despite the BASE methodology we use in the first tier Goal for this week: in-depth examination of

More information

Mapping Vector Codes to a Stream Processor (Imagine)

Mapping Vector Codes to a Stream Processor (Imagine) Mapping Vector Codes to a Stream Processor (Imagine) Mehdi Baradaran Tahoori and Paul Wang Lee {mtahoori,paulwlee}@stanford.edu Abstract: We examined some basic problems in mapping vector codes to stream

More information

Compilers Project Proposals

Compilers Project Proposals Compilers Project Proposals Dr. D.M. Akbar Hussain These proposals can serve just as a guide line text, it gives you a clear idea about what sort of work you will be doing in your projects. Still need

More information

The University of Saskatchewan Department of Computer Science. Technical Report #

The University of Saskatchewan Department of Computer Science. Technical Report # The University of Saskatchewan Department of Computer Science Technical Report #2006-03 SoftSphere: Dynamic Visualization of Coupling in Java Programs Andrew Sutherland Software Research Lab Department

More information

Predicting Faults from Cached History

Predicting Faults from Cached History Sunghun Kim 1 hunkim@csail.mit.edu 1 Massachusetts Institute of Technology, USA Predicting Faults from Cached History Thomas Zimmermann 2 tz @acm.org 2 Saarland University, Saarbrücken, Germany E. James

More information

NaCIN An Eclipse Plug-In for Program Navigation-based Concern Inference

NaCIN An Eclipse Plug-In for Program Navigation-based Concern Inference NaCIN An Eclipse Plug-In for Program Navigation-based Concern Inference Imran Majid and Martin P. Robillard School of Computer Science McGill University Montreal, QC, Canada {imajid, martin} @cs.mcgill.ca

More information

SolidSDD Duplicate Code Detector User Manual

SolidSDD Duplicate Code Detector User Manual SolidSDD Duplicate Code Detector User Manual For SolidSDD v1.4 March 2010 P a g e 3 Contents 1 Introduction... 5 1.1 Supported configurations... 6 1.2 Installation... 6 2 Main functions... 7 2.1 Basic

More information

Bug Triaging: Profile Oriented Developer Recommendation

Bug Triaging: Profile Oriented Developer Recommendation Bug Triaging: Profile Oriented Developer Recommendation Anjali Sandeep Kumar Singh Department of Computer Science and Engineering, Jaypee Institute of Information Technology Abstract Software bugs are

More information

Managing Open Bug Repositories through Bug Report Prioritization Using SVMs

Managing Open Bug Repositories through Bug Report Prioritization Using SVMs Managing Open Bug Repositories through Bug Report Prioritization Using SVMs Jaweria Kanwal Quaid-i-Azam University, Islamabad kjaweria09@yahoo.com Onaiza Maqbool Quaid-i-Azam University, Islamabad onaiza@qau.edu.pk

More information

Improved Database Development using SQL Compare

Improved Database Development using SQL Compare Improved Database Development using SQL Compare By David Atkinson and Brian Harris, Red Gate Software. October 2007 Introduction This white paper surveys several different methodologies of database development,

More information

Quantifying and Assessing the Merge of Cloned Web-Based System: An Exploratory Study

Quantifying and Assessing the Merge of Cloned Web-Based System: An Exploratory Study Quantifying and Assessing the Merge of Cloned Web-Based System: An Exploratory Study Jadson Santos Department of Informatics and Applied Mathematics Federal University of Rio Grande do Norte, UFRN Natal,

More information

File Size Distribution on UNIX Systems Then and Now

File Size Distribution on UNIX Systems Then and Now File Size Distribution on UNIX Systems Then and Now Andrew S. Tanenbaum, Jorrit N. Herder*, Herbert Bos Dept. of Computer Science Vrije Universiteit Amsterdam, The Netherlands {ast@cs.vu.nl, jnherder@cs.vu.nl,

More information

Mapping Bug Reports to Relevant Files and Automated Bug Assigning to the Developer Alphy Jose*, Aby Abahai T ABSTRACT I.

Mapping Bug Reports to Relevant Files and Automated Bug Assigning to the Developer Alphy Jose*, Aby Abahai T ABSTRACT I. International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 1 ISSN : 2456-3307 Mapping Bug Reports to Relevant Files and Automated

More information

Chapter 1 GETTING STARTED. SYS-ED/ Computer Education Techniques, Inc.

Chapter 1 GETTING STARTED. SYS-ED/ Computer Education Techniques, Inc. Chapter 1 GETTING STARTED SYS-ED/ Computer Education Techniques, Inc. Objectives You will learn: WSAD. J2EE business topologies. Workbench. Project. Workbench components. Java development tools. Java projects

More information

Week 7 Picturing Network. Vahe and Bethany

Week 7 Picturing Network. Vahe and Bethany Week 7 Picturing Network Vahe and Bethany Freeman (2005) - Graphic Techniques for Exploring Social Network Data The two main goals of analyzing social network data are identification of cohesive groups

More information

Software Metrics based on Coding Standards Violations

Software Metrics based on Coding Standards Violations Software Metrics based on Coding Standards Violations Yasunari Takai, Takashi Kobayashi and Kiyoshi Agusa Graduate School of Information Science, Nagoya University Aichi, 464-8601, Japan takai@agusa.i.is.nagoya-u.ac.jp,

More information

Are Refactorings Less Error-prone Than Other Changes?

Are Refactorings Less Error-prone Than Other Changes? Are Refactorings Less Error-prone Than Other Changes? Peter Weißgerber University of Trier Computer Science Department 54286 Trier, Germany weissger@uni-trier.de Stephan Diehl University of Trier Computer

More information

EmpAnADa Project. Christian Lange. June 4 th, Eindhoven University of Technology, The Netherlands.

EmpAnADa Project. Christian Lange. June 4 th, Eindhoven University of Technology, The Netherlands. EmpAnADa Project C.F.J.Lange@tue.nl June 4 th, 2004 Eindhoven University of Technology, The Netherlands Outline EmpAnADa introduction Part I Completeness and consistency in detail Part II Background UML

More information

Manage quality processes with Bugzilla

Manage quality processes with Bugzilla Manage quality processes with Bugzilla Birth Certificate of a Bug: Bugzilla in a Nutshell An open-source bugtracker and testing tool initially developed by Mozilla. Initially released by Netscape in 1998.

More information

A Detailed Examination of the Correlation Between Imports and Failure-Proneness of Software Components

A Detailed Examination of the Correlation Between Imports and Failure-Proneness of Software Components A Detailed Examination of the Correlation Between Imports and Failure-Proneness of Software Components Ekwa Duala-Ekoko and Martin P. Robillard School of Computer Science, McGill University Montréal, Québec,

More information

UML-Based Conceptual Modeling of Pattern-Bases

UML-Based Conceptual Modeling of Pattern-Bases UML-Based Conceptual Modeling of Pattern-Bases Stefano Rizzi DEIS - University of Bologna Viale Risorgimento, 2 40136 Bologna - Italy srizzi@deis.unibo.it Abstract. The concept of pattern, meant as an

More information

Impact Analysis by Mining Software and Change Request Repositories

Impact Analysis by Mining Software and Change Request Repositories Impact Analysis by Mining Software and Change Request Repositories Gerardo Canfora, Luigi Cerulo RCOST Research Centre on Software Technology Department of Engineering University of Sannio Viale Traiano

More information

What can we learn from version control systems?

What can we learn from version control systems? 2IS55 Software Evolution What can we learn from version control systems? Alexander Serebrenik Assignment 2: Feedback # mean std. dev A1 13 3.23 0.725 A2 15 3.2 1.08 Likes: Coding as opposed to report writing

More information

What can we learn from version control systems?

What can we learn from version control systems? 2IS55 Software Evolution What can we learn from version control systems? Alexander Serebrenik Assignments Assignment 4: Deadline: April 6 Questions? Assignment 5: Published on Peach Deadline: April 20

More information

Programming the Semantic Web

Programming the Semantic Web Programming the Semantic Web Steffen Staab, Stefan Scheglmann, Martin Leinberger, Thomas Gottron Institute for Web Science and Technologies, University of Koblenz-Landau, Germany Abstract. The Semantic

More information

Analysis of Sorting as a Streaming Application

Analysis of Sorting as a Streaming Application 1 of 10 Analysis of Sorting as a Streaming Application Greg Galloway, ggalloway@wustl.edu (A class project report written under the guidance of Prof. Raj Jain) Download Abstract Expressing concurrency

More information

A Web Application to Visualize Trends in Diabetes across the United States

A Web Application to Visualize Trends in Diabetes across the United States A Web Application to Visualize Trends in Diabetes across the United States Final Project Report Team: New Bee Team Members: Samyuktha Sridharan, Xuanyi Qi, Hanshu Lin Introduction This project develops

More information

Commit Guru: Analytics and Risk Prediction of Software Commits

Commit Guru: Analytics and Risk Prediction of Software Commits Commit Guru: Analytics and Risk Prediction of Software Commits Christoffer Rosen, Ben Grawi Department of Software Engineering Rochester Institute of Technology Rochester, NY, USA {cbr4830, bjg1568}@rit.edu

More information

Software Testing and Maintenance 1. Introduction Product & Version Space Interplay of Product and Version Space Intensional Versioning Conclusion

Software Testing and Maintenance 1. Introduction Product & Version Space Interplay of Product and Version Space Intensional Versioning Conclusion Today s Agenda Quiz 3 HW 4 Posted Version Control Software Testing and Maintenance 1 Outline Introduction Product & Version Space Interplay of Product and Version Space Intensional Versioning Conclusion

More information

BugMaps-Granger: A Tool for Causality Analysis between Source Code Metrics and Bugs

BugMaps-Granger: A Tool for Causality Analysis between Source Code Metrics and Bugs BugMaps-Granger: A Tool for Causality Analysis between Source Code Metrics and Bugs Cesar Couto, Pedro Pires, Marco Tulio Valente, Roberto Bigonha, Andre Hora, Nicolas Anquetil To cite this version: Cesar

More information

Identifying Changed Source Code Lines from Version Repositories

Identifying Changed Source Code Lines from Version Repositories Identifying Changed Source Code Lines from Version Repositories Gerardo Canfora, Luigi Cerulo, Massimiliano Di Penta RCOST Research Centre on Software Technology Department of Engineering - University

More information

Rubicon: Scalable Bounded Verification of Web Applications

Rubicon: Scalable Bounded Verification of Web Applications Joseph P. Near Research Statement My research focuses on developing domain-specific static analyses to improve software security and reliability. In contrast to existing approaches, my techniques leverage

More information

TopicViewer: Evaluating Remodularizations Using Semantic Clustering

TopicViewer: Evaluating Remodularizations Using Semantic Clustering TopicViewer: Evaluating Remodularizations Using Semantic Clustering Gustavo Jansen de S. Santos 1, Katyusco de F. Santos 2, Marco Tulio Valente 1, Dalton D. S. Guerrero 3, Nicolas Anquetil 4 1 Federal

More information

Mining a Change-Based Software Repository

Mining a Change-Based Software Repository Mining a Change-Based Software Repository Romain Robbes Faculty of Informatics University of Lugano, Switzerland Abstract Although state-of-the-art software repositories based on versioning system information

More information

IBM Best Practices Working With Multiple CCM Applications Draft

IBM Best Practices Working With Multiple CCM Applications Draft Best Practices Working With Multiple CCM Applications. This document collects best practices to work with Multiple CCM applications in large size enterprise deployment topologies. Please see Best Practices

More information

Application of Relation Analysis to a Small Java Software

Application of Relation Analysis to a Small Java Software Application of Relation Analysis to a Small Java Software Minna Hillebrand Jonne Itkonen Vesa Lappalainen Department of Mathematical Information Technology University of Jyväskylä Finland E-mail: {mmhilleb

More information

A Framework for Source Code metrics

A Framework for Source Code metrics A Framework for Source Code metrics Neli Maneva, Nikolay Grozev, Delyan Lilov Abstract: The paper presents our approach to the systematic and tool-supported source code measurement for quality analysis.

More information

Software Evolution: An Empirical Study of Mozilla Firefox

Software Evolution: An Empirical Study of Mozilla Firefox Software Evolution: An Empirical Study of Mozilla Firefox Anita Ganpati Dr. Arvind Kalia Dr. Hardeep Singh Computer Science Dept. Computer Science Dept. Computer Sci. & Engg. Dept. Himachal Pradesh University,

More information

A Study on Inappropriately Partitioned Commits How Much and What Kinds of IP Commits in Java Projects?

A Study on Inappropriately Partitioned Commits How Much and What Kinds of IP Commits in Java Projects? How Much and What Kinds of IP Commits in Java Projects? Ryo Arima r-arima@ist.osaka-u.ac.jp Yoshiki Higo higo@ist.osaka-u.ac.jp Shinji Kusumoto kusumoto@ist.osaka-u.ac.jp ABSTRACT When we use code repositories,

More information

Bug Inducing Analysis to Prevent Fault Prone Bug Fixes

Bug Inducing Analysis to Prevent Fault Prone Bug Fixes Bug Inducing Analysis to Prevent Fault Prone Bug Fixes Haoyu Yang, Chen Wang, Qingkai Shi, Yang Feng, Zhenyu Chen State Key Laboratory for ovel Software Technology, anjing University, anjing, China Corresponding

More information

Understanding Semantic Impact of Source Code Changes: an Empirical Study

Understanding Semantic Impact of Source Code Changes: an Empirical Study Understanding Semantic Impact of Source Code Changes: an Empirical Study Danhua Shao, Sarfraz Khurshid, and Dewayne E. Perry Electrical and Computer Engineering, The University of Texas at Austin {dshao,

More information

Appendix to The Health of Software Engineering Research

Appendix to The Health of Software Engineering Research Appendix to The Health of Software Engineering Research David Lo School of Information Systems Singapore Management University Singapore davidlo@smu.edu.sg Nachiappan Nagappan and Thomas Zimmermann Research

More information

Portfolios Creating and Editing Portfolios... 38

Portfolios Creating and Editing Portfolios... 38 Portfolio Management User Guide 16 R1 March 2017 Contents Preface: Using Online Help... 25 Primavera Portfolio Management Overview... 27 Portfolio Management Software for Technology Leaders... 27 Solution

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

Scalar Visualization

Scalar Visualization Scalar Visualization 5-1 Motivation Visualizing scalar data is frequently encountered in science, engineering, and medicine, but also in daily life. Recalling from earlier, scalar datasets, or scalar fields,

More information

Ranking Clustered Data with Pairwise Comparisons

Ranking Clustered Data with Pairwise Comparisons Ranking Clustered Data with Pairwise Comparisons Alisa Maas ajmaas@cs.wisc.edu 1. INTRODUCTION 1.1 Background Machine learning often relies heavily on being able to rank the relative fitness of instances

More information

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track Alejandro Bellogín 1,2, Thaer Samar 1, Arjen P. de Vries 1, and Alan Said 1 1 Centrum Wiskunde

More information