Empirical Study on Impact of Developer Collaboration on Source Code

Size: px

Start display at page:

Download "Empirical Study on Impact of Developer Collaboration on Source Code"

Stewart Hall
5 years ago
Views:

1 Empirical Study on Impact of Developer Collaboration on Source Code Akshay Chopra University of Waterloo Waterloo, Ontario Parul Verma University of Waterloo Waterloo, Ontario Sahil Puri University of Waterloo Waterloo, Ontario ABSTRACT Software development is a collaborative effort in which various developers work with each other to create or maintain applications or other software components. Since multiple developers work together to achieve the project goals, this collaboration may have impact on the quality of the final software which may be measured in terms of the number of bugs or defects. In this paper, we try to analyze the effect of developer collaboration on bug proneness of the software by analyzing 50 open source Java projects on Github having considerable project history. We consider each Java file as individual classes and analyze their history to find out the amount of collaboration involved among developers. We then link this developer collaboration to the defects traced in those files to empirically find the effect of collaboration on the software quality. We also try to see if the amount of developer collaboration and defect proneness is influenced by various project characteristics such as source lines of code (SLOC) and project age. Our findings through this study can be summarized as follows: (a) Majority chunk for a great percentage of projects has been maintained by three or less developers (b) More collaboration on a single file leads to more bugs being logged for those files (c) As the number of developers increase with the lines of code decreasing and project age increasing, there is more maintenance work done on the project instead of new features being added. KEYWORDS Developer collaboration, bugs, defect proneness, software quality, Human factor in software engineering 1 INTRODUCTION Source code refers to the final product of efforts single developer or the collaboration of various developers who work together to develop a particular product. Since most of the projects are large scale, they are not within the scope of development of a single individual. In order to achieve this complex task, development effort is often split across teams of individuals, who are responsible for one or more (less complex) concerns of the development effort. The evolution of software version control systems like Github, SVN, etc. as well as issue tracking systems have made it possible to capture traces of these collaborative activities amongst developers. Both open and closed source projects involve collaboration of developers working in different parts of the world and hence their University of Waterloo, Waterloo, Ontario 2018 David R. Cheriton School of Computer Science management is a challenging task. Moreover, changes in the workforce of the teams and companies also impact the level of collaboration in developing a particular feature. Collaboration among the developers is also influenced by the organizational structure as well as the architecture of the software being developed which may change as the project evolves. Thus, software development requires coordination and communication among various developers and teams. The dynamic structure of developer collaboration can be measured accurately by mining the version control systems and analyzing the activity of the developers. Extent of collaboration among developers in a project may also exhibit a relationship with the quality of the software product. In software engineering, software quality is most often linked with the defects encountered in the system which are tracked by the version control systems and the bug tracking systems. In the absence of bug tracking systems, version control systems have become a commonplace to find where bugs have occurred in the past or even to predict where bugs might occur in future. In 1995, Brooks et. al [1] showed that at organizational level, teams with low interdependencies between each other tend to reduce their defect density and hence improve the quality of the software. However, Caglyan et. al [2] showed that the network structure at the source code level is different from the organizational structure. Catalado et. al studied the effects of structure of the teams in a project on the project quality in which one of their finding was that the team structures differ from source code level team collaboration structures. Hence, the influence of the source code network structure on defects may be different from the influence of organizational structure. In this paper, we analyze the impact of collaboration among developers on the quality of the software modules using 50 highly rated open source Java software projects available on Github. We assume that the quality of the projects can be determined by analyzing the commit messages of the source code repositories in the absence of bug tracking system by using bug heuristics such as bug, error, fix, etc. We analyzed the effect of developer collaboration on the software quality at change level. We first go through the git history of all the files in a software project to find the extent of developer collaboration and then check its commit messages to see how many defects were reported in that particular file. In order to avoid the data threats, we report relative bug metrics instead of absolute bug metrics to get a better understanding of the consequences of developer collaboration. The major contributions of this research paper are: (1) We empirically study the effects of developer collaboration on software quality for 50 large scale open source projects. (2) We try to relate the extent of developer collaboration on various project metrics such as age of project, SLOC, etc.

2 University of Waterloo, Waterloo, Ontario A. Chopra et al. 2 RESEARCH QUESTIONS 2.1 Research Question 1 What is the density of developer collaboration in a single project? i.e. How many files per project have collaboration from how many developers. Motivation: Developers work together during software development and maintenance to resolve issues and implement features in software projects. Large open source Java projects have many classes implemented to achieve the required functionality. According to Java design practices, each class should be represented as separate files in the project. Also, in large projects, many developers collaborate to implement a functionality and hence classes in the projects might be a result of a single developer or a collaboration of developers. Through this research question, we identify the percentage contribution of each developer in the project. (2) Data stored as JSON object after step 1 was further processed to gather more detailed information: No of unique developers in a project from commit history. Identity if a file is a buggy file based on the keyword bug, fix, issue, close and error in commit messages. Distribution of source files based on the no. of developers who worked on it. Project start date based on the repository creation. Total bugs, total SLOC, mean of bugs and SLOC associated with each developer group. 2.2 Research Question 2 Does concurrent updates from multiple developers result in more bugs rather than those classes which are maintained by less number of developers? Motivation: The structure of their development collaboration activity may have impact on the quality of the final product in terms of higher number of defects. Since developer collaboration is usually a common activity in large software projects, it would be a promising idea to understand the effect of collaboration on the defect proneness. Commit history and commit messages provide a good indicator to identify the bug fix commits using the bug heuristics present in software engineering literature which includes identification of words such as bug, fix, error, etc. to identify the fixes done. Depending on the files changed during the bug fix, we see whether those files were changed by multiple developers or by a single developer. 2.3 Research Question 3 Is there any correlation between project characteristics and developer collaboration worth mentioning? Motivation: Various characteristics of project may have a direct impact on developer collaboration and if there is any correlation amongst them. The characteristics that we would want to evaluate is age of the project, Source lines of code, etc. 3 DATASET DESCRIPTION As part of the research, we collected data of 50 projects from GitHub. The projects were first sorted based on their ratings and only the projects that has more than 80% Java files and more than 2,000 commits were chosen for further analysis. The project files were collected using GitHub REST API. Data collection process included 2 steps: (1) Using the python git, we collected the following features for each project and stored the data for each project as a JSON object for further processing. The project features extracted in step 1 are: Project Name, Project URL, project SLOC and source code files. Figure 1 is a snapshot of the data gathered after step 1. Figure 1: Features extracted after Step 1 in Data Collection Process The data extracted each project as part of step 2 was also stored as a JSON object. Figure 2 shows the snapshot of JSON object created in step 2.

3 Empirical Study on Impact of Developer Collaboration on Source Code University of Waterloo, Waterloo, Ontario dataset. The range of number of unique developers across multiple projects varied from 2 âăş 159 and average number of developers are 45. Figure 6 shows the distribution of no of unique developers. Project vs Age This section shows the age distribution of all the projects included for analysis. The project is calculated as the difference between the current time and the time when the project repository was created. Figure 7 shows the age distribution across all projects. Figure 2: Features extracted after Step 2 in Data Collection Process distributiondict is a dictionary that stores information about the number of unique developers and the source code files they worked on e.g. a total of 1736/1969 files were written by single develop, 202/1969 files are written by a group of two developers and so on. 4 DATA CHARACTERISTICS This section explains the characteristics of the data that we have collected for the analysis. The various characteristics are: Project vs SLOC This section shows the various projects that we have collected for analysis and their source line of code. SLOC across various projects varied from 40,000-1, 200, 000 lines of code whereas the average lines of code are 200, 000. Figure 3 shows the plot between various projects and their SLOCs. Project vs Buggy File ratio This section provides a picture about the buggy files across each project. Total number of bug files in a project varied from 10% - 100% of project files. Buggy files are classified as per the keywords bug, fix, error, issue and close in the commit messages of the files in a project. Bug File ratio can be defined as: Bug File Ratio = No. of Bug Files in a project / Total number of files in a project Figure 4 shows the distribution of bug file ratio across various projects. Figure 5 shows the distribution of number of bugs across each project Project vs Unique Developers This section shows the distribution of number of unique developers that have worked on individual projects in the Figure 8: Project vs Age(in days) As the age of a project increase more number of developers started working on that project which is shown in figure 8. Figure 9: Number of unique authors in a project increase with increase in project age 5 ANALYSIS This section discusses the insights that were collected as a part of the analysis done on 50 projects as explained in the previous section. We performed exploratory analysis on the data to discover trends between developer collaboration and link them with project characteristics such as Lines of Code, Project Age, Bug proneness of the code. (1) SLOC Distribution for different projects In this analysis, we studied the amount of collaboration in the 50 projects and attempted to find what percent of a project

4 University of Waterloo, Waterloo, Ontario A. Chopra et al. Figure 3: Project vs Source line of code(sloc) Figure 4: Projects vs Bug File Ratio Figure 5: Project vs Total bugs has been edited by how many developers. To perform this analysis, we used to following metric Also as SLOC distribution is the ratio of project maintained by a specific developer group ( developer group refers to Java classes which are maintained by âăÿnâăź developers, where âăÿnâăź can vary from 1 to the total number of unique authors for the project). In the figure 10, the X-axis represents different projects that were used for performing the analysis. The Y-axis shows the developer collaboration metric as explain above. The legends signify the âăÿnâăź value for the developer collaboration. For e.g. Series1 denotes the developer collaboration for 1 developer etc. As can be seen from the above graph, the major chunk of SLOC distribution is present within n <=3. So the major chunk for a great percentage of projects has been maintained by 3 developers or less. (2) Author Distribution for SLOC Ration This analysis supports the previous analysis by calculating the SLOC worked by different author groups. For the purpose of this analysis, the X- axis shows the number of developers working on the Java class. The Y-axis shows the SLOC Ratio for the specific developer group. Each series represents different projects used for this analysis. As can be observed by the trend lines added for the three projects, the SLOC distribution for a specific developer groups decreases as the number of developer in a group increase. This supports the previous analysis that majority chunk

5 Empirical Study on Impact of Developer Collaboration on Source Code University of Waterloo, Waterloo, Ontario Figure 6: Project vs Total Bugs in a project Figure 7: Projects vs Unique authors in a project Figure 10: SLOC Distribution = Files worked upon by unique no. of developers / Total Files of the projects code is maintained by less than three developers working on a Java class. (3) Developer Collaboration v/s Bugs per SLOC In this analysis, we calculated the bugs per software line of code for distribution of Java classes which have been worked upon by varying number of developers. In the figure 12, the X axis represents the number of developers working on a Java class. For e.g., 1 represents that the Java classes which were created and maintained by a single developer, 3 stands for those Java classes that were created by and maintained by three developers. The Y-axis represents the bugs per SLOC that have been found for the corresponding java classes. As can be seen from the four trend lines that have been created in the chart, as the number of developers working on a single java class increase, the number of bugs per SLOC also increases. This shows that having more developer collaboration on a single java class makes increases its probability of having bugs as compared to classes which have been modified by lesser number of developers. [] (4) Mean Bugs for Developer Distribution This analysis calculated the mean bugs for the different developer groups. This analysis is done to observe the pattern amongst the bugs logged for java classes as per the number of developers who collaborated on that java class. In the figure 13, the X-axis shows the different developer groups, where 1 stands for those java classes which have been worked on by a single developer. The Y-axis in turns show the mean value of bugs that have been logged for those files. As can be observed by the added trendlines, as the number of developers working on a file increase, the mean value of the bugs also increase. Hence, this points

6 University of Waterloo, Waterloo, Ontario A. Chopra et al. Figure 11: SLOC Ratio = SLOC of Developer Group / Total SLOC of project Figure 12: Bugs per SLOC vs No. of unique developers for 30 projects (considering max of 10 developers) to the pattern that more collaboration on a single file leads to more bugs being logged for those files. (5) Number of Developers vs Lines of Code in a project and Project Age This analysis compares the number of developers working on various projects with respect to the density of source code in those projects. The X-axis in the figure 6 chart denotes the number of developers working in a project. The Y-axis denotes the lines of code in the project. There are two series in the chart. The orange series points to the SLOC for different projects. The trendline for the series shows that as the number of developers working on a project increase, the net source lines of code tend to decrease. This was counter-intuitive as we initially thought that more developers would result in

7 Empirical Study on Impact of Developer Collaboration on Source Code University of Waterloo, Waterloo, Ontario Figure 13: Mean no. of Bugs vs No. of unique developers more code being written for the project. We used another dimension, the project age, which is represented by the grey series in the chart. It can be seen that as the number of developers increase for a project, the project age was also increasing. The cumulative insight from the above two series show that as the number of developers increase with the lines of code decreasing and the project age increasing, there is more maintenance work done on the project instead of new features being added. 6 RELATED WORK In the past years, many researchers have focused their research on finding the effect of developer collaboration on the quality of the software. Caglayan et al. [2] investigated the evolution of the developer collaboration network with time during a release of a large-scale project. Moreover, Nagappan et. al [3] in 2008 did a study to find the relationship between the structure of the organization and the quality of the software. They conclude their study by providing a list of organizational metrics that should be considered in the structure of the teams and organizations in order to reduce the number of defects arising in the software and improve the overall quality of it. 7 THREATS TO VALIDITY Our biggest threat to validity is the way bug fix commits are identified. We iterated on the commit tree of a repository and searched for the keywords bug or issue or fix or close or error and marked it as a bug fix commit to be used in further analysis. We deliberately chose these words as heuristics as we wanted to capture the issues that developers continuously face in an ongoing development process. However, such a choice possesses threats of over estimation as the descriptiveness of commit logs vary across the projects. Moreover, the source lines of code (SLOC) considered in our analysis refers to the count of the lines as specified by the head of the git repository and not at commit level. Also, the total number of developers in a project are calculated on the basis of distinct committers that have contributed on a branch. Also, there might be a possibility that there are only a few number of committers in a repository which may not represent the actual developer counts as well as developers that worked on a particular file. For example, project CoreNLP has a total of only 2 committers while it has a total SLOC of 759,702. It is highly unlikely that only 2 developers would have contributed to this much amount of code and hence its possible that developers would have worked on the project but the final commit to the master branch was made by only 2 individuals. Also, we consider only the master branch of the software repositories and not all branches which may lead to bias in the dataset. Moreover, our work might also be prone to overestimation due to forks since we took Java projects from only one data source i.e. GitHub and projects on GitHub are easy to fork. This may lead to a potential increase of very similar projects in our dataset. However, we have manually analyzed each repository to understand if a fork of the repository is already chosen for analysis and hence there are no duplicate repositories present in the dataset we chose to the best of our knowledge. The repositories selected(from GitHub) have at least 2000 commits old and are relatively stable and hence the results published in this work cannot be generalized for repositories with fewer than 2000 commits(younger) or repositories which are developed in a professional setting in corporate companies with strict code commit rules.

University of Waterloo, Waterloo, Ontario A. Chopra et al. 8 FUTURE WORK AND CONCLUSION This work can be extended to analyze more diverse repositories from all kinds of sources and ages.

8 University of Waterloo, Waterloo, Ontario A. Chopra et al. 8 FUTURE WORK AND CONCLUSION This work can be extended to analyze more diverse repositories from all kinds of sources and ages. Since this work has analyzed Java repositories alone, it would be interesting to see the results in other languages such as Python, C++, etc. Also we consider open source projects from Github only. Projects from other version control systems such as SVN may add diversity to the projects and may showcase more generic results. As explained in the previous section, linking number of developers directly to the number of committers is troublesome as visible in the case of CoreNLP. Some better mechanism of finding the actual number of developers may help in giving a real picture of the impact of developer collaboration on software quality. As part of this project, we analyzed 50 open source Java projects (see in Appendix) with varied project characteristics and inferred the impact of developer collaboration on the bug proneness of the source code. We converted data from each individual project into JSON object using python git. By parsing and analyzing those objects we were able to determine that the major chunk of source code was added by three developers or less. Another observation from the analysis is that higher collaboration in a source file leads to more errors being logged in that file. In addition to this, we also observed that as the project age increased along with the increase in number of developers, the source code density i.e. SLOC decreased which pointed to the inference that there was more of maintenance and support activity rather than new feature implementation. 9 ACKNOWLEDGMENTS Many thanks to Professor Michael Godfrey for his invaluable comments and feedback on the research methodology and data analysis. The authors would also like to thank the University of Waterloo for providing the computing resources and other related infrastructure provided which allowed us to run our tool parallel on multiple systems. 10 APPENDIX Figure 14 shows the details of the dataset that we used. It includes the links of project repositories and the bug characteristics of those projects. REFERENCES [1] J. Frederick P. Brooks, The mythical man month : Essays on software engineering, [2] A. M. Bora Caglayan, Ayse Basar Bener, Emergence of developer teams in the collaboration network [3] N. Nagappan, E. M. Maximilien, T. Bhat, and L. Williams, Realizing quality improvement through test driven development: Results and experiences of four industrial teams, Empirical Softw. Engg., vol. 13, pp , June [4] B. ÃĞaglayan and A. B. Bener, Effect of developer collaboration activity on software quality in two large scale projects, Journal of Systems and Software, vol. 118, pp , [5] S. Alhassan, B. Caglayan, and A. Bener, Do more people make the code more defect prone?: Social network analysis in oss projects., [6] A. Meneely, L. Williams, W. Snipes, and J. Osborne, Predicting failures with developer networks and social network analysis, in Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering, SIGSOFT 08/FSE-16, (New York, NY, USA), pp , ACM, [7] M. Pinzger, N. Nagappan, and B. Murphy, Can developer-module networks predict failures?, in Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering, SIGSOFT 08/FSE-16, (New York, NY, USA), pp. 2 12, ACM, Figure 14: Characteristics of the project taken into this research [8] G. Madey, V. Freeh, and R. Tynan, The open source software development phenomenon: An analysis based on social network theory, [9] T. Zimmermann and N. Nagappan, Predicting defects with program dependencies, in rd International Symposium on Empirical Software Engineering and Measurement, pp , Oct [10] E. J. Weyuker, T. J. Ostrand, and R. M. Bell, Using developer information as a factor for fault prediction, in Predictor Models in Software Engineering, PROMISE 07: ICSE Workshops International Workshop on, pp. 8 8, May [11] F. Eichinger, K. Böhm, and M. Huber, Mining edge-weighted call graphs to localise software bugs, in Machine Learning and Knowledge Discovery in Databases (W. Daelemans, B. Goethals, and K. Morik, eds.), (Berlin, Heidelberg), pp , Springer Berlin Heidelberg, 2008.

Empirical Study on Impact of Developer Collaboration on Source Code

Empirical Study on Impact of Developer Collaboration on Source Code Akshay Chopra, Sahil Puri and Parul Verma 03 April 2018 Outline Introduction Research Questions Methodology Data Characteristics Analysis