Towardsunderstanding: Astudy ofthe SourceForge.net community using modeling and simulation

Towardsunderstanding: Astudy ofthe SourceForge.net community using modeling and simulation YongqinGao,GregMadey DepartmentofComputerScience and Engineering UniversityofNotreDame ygao,gmadey@nd.edu Keywords: Complex Network, Open Source Software Research, Agent Based Simulation, Verification and Validation Abstract The Open Source Software (OSS) development movement is a classic example of a social network; also it is a prototype of a complex evolving network. In this research, we used a iterated method to generate models depicting the evolution of this network. Then, using Java/Swarm, we simulated in our model the evolution of the OSS developer and project network and calibrated the model using the two year empirical data we collected from SourceForge. In simulation, we implemented several different models to replicate the empirical data. From these simulation iterations we were able to verify and validate our model for this network and to increase our understanding of the OSS developer collaboration network. As a result, we generated a fit model which can replicate all measures we inspected in the set. INTRODUCTION Open source software, usually created by volunteer programmers dispersed worldwide, now competes with that developed by commercial software firms. This is due in part because closed-source proprietary software has associated risks to users. These risks may be summarized as systems taking too long to develop, costing too much and not working very well when delivered. Open source software offers an opportunity to surmount those disadvantage of proprietary software. SourceForge is one of the most popular OSS hosting web sites, which offers features like bug tracking, project management, forum service, mailing list distribution, CVS and more. In January 7 it had almost.5 millions registered users and almost thousands registered projects. It is a good sample of the OSS community to study the underlying mechanisms of OSS communities. The collaboration relationships between the developers in the OSS community are the focus of our study. These relationships can be studied if we model them as a complex network. Scientists and mathematicians have been studying such Open source is a certification mark owned by the Open Source Initiative(http://www.opensource.org) http://sourceforge.net networks for some time. Different from the traditional networks computer scientists study, randomness is dominant in this kind of network, since there is no centralized control of the construction and evolution of the network, every node in the network will determine what nodes to connect to itself based on the local knowledge it has. In this paper, we will use different network models for the complex networks, including random networks [6], Small world networks [], and Scale-free networks []. Finally, we will propose a modified scale-free network model to fit the OSS community. Previous research involving the simulation of complex networks is based on snapshots [,, ], which focused on the topology of the network at a given point in time. Since we collected data over a two year period, we were able to look into the evolution of the network. This helped us understand the evolution of the network topology, the development patterns of any single object in the network and the impact of the interaction among objects on the evolution of the overall network system. Preliminary reports on this research have been published in proceedings of NAACSOS [8, 9] and ADS [7]. The remainder of this paper is structured as follows. In section, we explain our approach to do the modeling and simulation including methodology and experiment setup. In Section, we discuss the model we generated to depict the SourceForge community. The result and analysis of the simulation we build for the SourceForge community and the verification and validation of the simulation are provided in Section. Finally, we conclude in Section 5. OUR APPROACH Simulation Methodology We will use the same simulation procedure we proposed in the previous study [, ], which is an iterated method to do modeling and simulation. The main goal of the iterations is to get a fit model to describe the evolution of a complex network by simulation. Verification and validation is the most important step in the simulation iterations []. By comparing against the empirical data set repeatly, verification can make the model converge to the empirical system. There are two important components in the verification step - measures and methods. The measures are the criterion for the verification and the methods are how SpringSim Vol. 5 ISBN -56555--6

the verification is carried out. In our study, we used both statistical and network measures:. Network measures includes degree distribution, average diameter, average betweenness and so on. These measures describe the characteristics of the network structure and its evolution pattern.. Statistical measures includes monthly pageviews, downloads, number of developers in the group and so on. These measures describe the characteristics of individual entity in the network and its evolution pattern. For the verification methods, we used two verification methods to verify - visual inspection and statistical hypothesis test. Experiment Environment Our simulations are executed on a small computer cluster, which includes multiple simulation servers and duplicate database servers. These machines were connected by a M(bps) switched LAN (Local Area Network). The simulation servers, which are used to run the simulations, are all dual Xeon.6GHz with G of memory and G of swap. The database servers are used to host the repository database and working database. One (the repository database server) is dual Xeon.6GHz with G of memory, 6G of swap and T storage and the other (the working database server) is dual Xeon.6GHz with G of memory and G of swap. The operating systems of the cluster are as follows: The simulation servers are Linux with..-.elsmp kernel. The database servers are Linux with..-.elsmp kernel with PostgreSQL 7.. The simulation toolkit we used was Swarm and the programming language we used was Java. MODEL DESCRIPTION The model for simulation is like a procedure definition with related parameters. We are using the same procedure definition as our previous study [7]. In the model, there were two kinds of entities developer and project. Developers, which are the active roles in real SourceForge network, will also be the active entities (agents) in our model. Projects will stored in the simulation as an accessory data structure. Our model is a time-step based simulation, which means that every agent in the simulation is triggered to act at every time step. We define the time step in our model as reflecting a single day in the real world. At every time step, a given number of new developers, N(ND), will be generated and added into the simulation. In our model, the actual number of new developers at every time step was a random number meaned at N(ND). Then every existing developer (existing and new developers) in the simulation was triggered to act one by one. The procedure definitions for new developer and existing developer were different, which we will discuss separately. New developer. Every new developer is triggered to act right after it is generated. There are two possible actions for a new developer creating or joining. Creating means that the developer would create a new project and thus add a link from this developer to a new project. Joining means that the developer would select an existing project according to some rules, and thus a link from this developer to the project is added. P(CN) and P(JN) are the two parameters that controlled the probability of creating and joining. The one exemption is that the very first developer in the simulation has to create a new project because there are no existing projects in the system yet. Existing developer. Every existing developer is triggered to act at every time step. There are four possible actions for an existing developer creating, joining, abandoning and idling. Creating and joining are similar to the definitions for the new developer. Dropping means that the developer would drop a randomly selected project in which he had participated. Idling means that the developer did nothing, which is normally the most frequent action a developer will take. P(CO), P(JO), P(DO) and P(IO) are the four parameters to control the probability of creating, joining, dropping and idling. The probabilities that a developer would choose a certain action are different in different models for simulations based on different considered attributes. This is only the skeleton of the procedure definitions in our models. Different simulation models may have different procedure definitions. Most of these differences are related to the rules a developer used to choose an action and the rules a developer used to decide which project to join. Also, the related parameters may be different for different simulation iterations. All the procedure definitions are based on the skeleton procedure described in this section. We will explain the detailed procedure definitions and related parameters for every model when we discuss that specific model. RESULTS AND DISCUSSIONS In our previous study [, ], we iterated through adaptive ER model, BA model, BA model with constant fitness and BA model with dynamic fitness. As the starting point of current study - model one, we improved the last model in the previous study - BA model with dynamic fitness in the following aspects to better match the real SourceForge.net community. ISBN -56555--6 6 SpringSim Vol.

The first improvement is on the stochastic procedures. Since the networks we are simulating are complex networks, we will utilize multiple stochastic procedures in the model to replicate these complex characteristics. In our models, we have the following stochastic procedures: Generating new developers every time step; Generating fitness for new projects; Picking action at every time step for every user; Picking a project to join. The latter two stochastic procedures are well defined by developer preferential attachment and project preferential attachment discussed in the previous study. On the other hand, the first two stochastic procedures are not well defined in the previous models. Redefinition of these two stochastic procedures will be the first improvement in our first iterated model. In previous study, we assume that these two stochastic procedures are based on discrete uniform distribution. However, this is not a reasonable assumption in real networks like SourceForge.net community networks. We will replace the underlying distributions for these two stochastic procedures. For the first stochastic procedure generating new developers every time step, Poisson distribution is a reasonable approximation for the underlying distribution. Normally, it expresses the probability of a number of events occurring in a fixed period of time if these events occur with a known average rate, and are independent of the time since the last event. The probability density function for the Poisson distribution is F(k;λ) = e λ λ k where e is the base of the natural logarithm, k! is the factorial of k and λ is the expected number of occurrences that occur during the given interval. So we replaced the underlying distribution for the random number generator in the first stochastic procedure with the Poisson distribution with average λ at the average we inspected in the. For the second stochastic procedure initial fitness for new project, we used log-normal distribution as the underlying distribution for the stochastic procedure. The probability density function is k! f (x;µ,σ) = xσ /σ e (lnx µ) π for x >, where µ and σ are the mean and standard deviation of the variable s logarithm. The actual mean and standard deviation used in our model are picked according to the set to achieve a good fit. The second improvement is applying a periodical update procedure for project list. By applying these two improvements, we got our first model in this study. As we discussed before, we verified this model by comparing the network measures calculated from the set with the network measures calculated () () from the set. Our simulation starts from the very beginning of the network (the first node) and up to three simulated virtual years. The set starts from January and intermittently after that. The comparison will be taken in the overlapped time period, reflecting January to December for and simulated month to simulated month 6 in the simulation. This and other data from SourceForge.net is available for scholarly research from the authors [, ]. In addition to visual inspection that we used in the previous study, we will also use statistical hypothesis test to verify the match of the network measures. There are two kinds of statistics hypothesis test - parametric hypothesis testing and non-parametric hypothesis test. In our study, we can not make the assumption that the data set is normally distributed, so we will use non-parametric hypothesis test to verify the similarity of the measures from the simulation and the set. One of the tests we used is Chi-squared χ test for categorized data. The statistics χ is calculated by finding the difference between each observed and expected value, squaring them, dividing each by the theoretical value, and taking the sum of the results: χ = n i= (O i E i ) E i () where O i is an observed value and E i is an expected(empirical) value. We will use the χ test on verifying the network measures like average degrees, approximate diameter, approximate clustering coefficient, average betweenness and average closeness. The other test we used is the Kolmogorov-Smirnov K- S test for determining whether two underlying distributions of two data sets differ, or whether an underlying distribution differs from a hypothesized distribution, in either case based on finite samples. The two one-sided Kolmogorov-Smirnov test statistics are given by D + n = max(f n (x) F(x)) () D n = max(f(x) F n(x)) (5) The details about K-S test can be found at [5]. In our study, we used the K-S test to verify the degree distributions in the simulated data set match the corresponding distributions in the. The first network measure we inspected is the average degree and their development through the interested time period. The results are shown in Figure. The X coordinate is the time. The time range is [,], which is reflecting time range [January, December ] in the set and time range [month, month 6] for the simulated data set. The Y coordinate is the average degree value. All the SpringSim Vol. 7 ISBN -56555--6

. Developer in C NET Project in C NET...8.6...5.5 8 7.8 Diameter.7.75 CC 9 8.5 8 7.5 Developer in D NET 7.8.6. Project in P NET. Figure. s Comparison. x represents the and + represents the simulated data Approximate Diameter 7.6 7. 7. 7 6.8 Approximate Clustering Coefficient.7.75.7.75.7.75 four average degrees yield similar data points in empirical data and simulated data. We also verified this by using statistical hypothesis test by applying the χ test for all the average degrees. The χ for these four average degrees are.,.7,. and.7 respectively. By choosing significance level α at. and the d f value at, the critical value is.66. These values are all well below the critical value. Thus the null hypotheses are not rejected, which means the average degrees in the simulated virtual data set have no statistically significant difference from the average degrees in the set. Next, we verify the simulation results using the approximate diameter and approximate clustering coefficient. Figure shows the diameter development and the clustering coefficient development. Using visual inspection, we found the developments of diameter and clustering coefficient from the simulated data is similar to those of the. We also used the χ test to verify the similarity of the developments in the simulated data and the developments in the with α as. and d f as. The χ value for the diameter development and clustering coefficient development are.6 and.e-, which are well below the critical value at.66. So the null hypotheses are not rejected and thus the simulated data yields similar diameter and clustering coefficient development trends as the. The next network measures are the betweenness and closeness. we show the measures in Figure. By using the χ test, we found the χ value for the betweenness development and the closeness development are 5.8e-8 and.6e-9, which are well below the critical value.66 using α as. and d f as. So the null hypotheses are not rejected and, thus, the simulated data yields similar average betweenness and average closeness development trends as the. 6.6.7 Figure. Diameter and Clustering Coefficient Comparison. x represents the and + represents the simulated data Average Btweenness.69 x.68.67.66.65.6.6.6.6 Betweenness.6 Average Closeness.6 x 6.5.... Closeness. Figure. Betweenness and Closeness Comparison. x represents the and + represents the simulated data ISBN -56555--6 8 SpringSim Vol.

Developers Developers(LOG) 6 x 5 5 Participated projects 6 5 Dev Dist in C NET.5.5 Participated projects(log) Projects Projects(LOG).5 x.5.5.5 Proj Dist in C NET 5 5 Participated Developers 6 5 Participated developers(log) Figure. Degree Distributions Comparison. x represents the and o represents the simulated data Finally, we also verified the simulation model using the degree distributions. The degree distributions are shown in Figure. We used the K-S test to test whether the degree distributions in the simulated data set are the same as the degree distributions in the. The statistics for the developer degree distribution and project degree distribution are.75 and.979. Using the common significance level at.5, the hypotheses are not rejected. Thus both the developer degree distribution and project degree distribution in the simulated data are the same as the corresponding distributions in the. However, by visual inspection, we can still find differences between the degree distributions of the set and the simulated data set. For the developer degree distribution, the simulated data set has a significantly shorter tail than the. For the project degree distribution, the simulated data set has a significantly longer tail than the empirical data. These deficiencies (the busiest developer in the simulated data set participated in just projects instead of 5 in the and the biggest project in the simulated data set has more than 5 developers instead of 95 in the ) are not reasonable for the real world community such as SourceForge.net. So we iterated more models to make these deficiencies. The second iterated model is trying to make up for the deficiencies discovered in the previous model. To achieve this, we added a new parameter into the model - user energy. In the previous models, we introduced the fitness factor into the project preference. On the other hand, user preference is based only on the number of projects he/she participated. The deficiency in the previous model could be the result of lacking user fitness, so in this model we introduced a fitness factor for user preference, which is called user energy. We define the energy as a constant parameter γ i for every user, where γ i is chosen for a uniform continuous distribution between (,]. This energy level can be described as the probability of a user will take a action instead of idling during every time step. This model yielded similar results for most of the network measures as previous models except the degree distributions. The biggest project in the project distribution decreased from more than 5 to below 5. However, the developer distribution still has significant deficiency at the tail, compared to the same measures in the. Further iteration on the modeling introduced dynamic user energy into the model. In this model, we made the energy decay linearly by time and meanwhile it can be self-adjusted by taking roles and responsibilities or giving up roles and responsibilities in the community. By using this dynamic energy, we expected to catch the details of the possible influence of a user to the related project, even the whole community, and then generate better matches in both the developer distribution and the project distribution. The results of the simulation of this model improve the matches of developer distribution and project distribution. The degree distributions of this model yield similar results as the previous models except that the busiest user is increased to projects, which is more close the the while the size of the largest project is kept below. This is more reasonable tail for the SourceForge.net and close to the set. Thus this model is able to replicate all the network measures we used in the verification by both visual inspection and statistics hypothesis test methods. After analyzing and iterating the models, our third model reproduced all the inspected network measures we chose of the and captured all the phenomenon we wanted to highlight in the community and can be used to simulate the community and possibly predict the future of the SourceForge.net community by extending the simulations. CONCLUSION In this study, we used simulation methods to assist in understanding the topology and evolution of the Source- Forge.net developer collaboration network. In the study, we used an interactive method to develop more models for the SourceForge.net community. We iterated three more models from our previous study to generate the virtual SourceForge.net. In the simulation iteration, we verified the result by comparing the measures from the simulated data and measures from the using both visual inspection and statistical hypothesis test. The final result of the study is a model to simulate the SourceForge.net with matched network measures. Using this model, we are able to simulate the SourceForge.net community and possibly SpringSim Vol. 9 ISBN -56555--6

predict the future development of the community by simply extendingthesimulationtime. Acknowledgment.ThisworkwassupportedinpartbytheNationalScienceFoundation,CISE/IIS-DigitalSociety&Technology,underGrantNo.89[,]. REFERENCES [] OSS research(http://www.nd.edu/~oss). 6. [] OSS research portal (http://zerlot.cse.nd.edu). 6. [] R. Albert and A.L. Barabási. Emergence of scaling in random networks. Science, 86:59 5, 999. [] R. Bellman. Adaptive control processes: A guided tour. Princeton University Press, Princeton, NJ, 96. [5] I. M. Chakravarti, R. G. Laha and J. Roy. Handbook of methods of applied statistics. John Wiley and Sons, I, 967. [6] P. Erdös and A. Rényi. On random graphs. Publicationes Mathematicae, 6:9 97, 959. [7] Y. Gao, G. Madey and V. Freeh. Modeling and simulation of the open source software community. ADS 5, San Diego, 5. [8] Y. Gao, V. Freeh and G. Madey. Analysis and modeling of the open source software community. NAACSOS, Pittsburgh,. [9] Y. Gao, V. Freeh and G. Madey. Conceptual framework for agent-based modeling and simulation. NAACSOS, Pittsburgh,. [] G. Madey, V. Freeh and R. Tynan. Agent-based modeling of open source using swarm. Eight Americas Conference of Information Systems,. [] G. Madey, V. Freeh and R. Tynan. The open source software development phenomenon: an analysis based on social network theory. Eight Americas Conference of Information Systems,. [] M.E.J. Newman and D.J. Watts. Random graph models of social networks. Physics Review, 6(68),. [] D.J. Watts and S. H. Strogatz. Collective dynamics of small-world networks. Nature, 9:, 998. ISBN -56555--6 5 SpringSim Vol.