Towardsunderstanding: Astudy ofthe SourceForge.net community using modeling and simulation

Similar documents
An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network Theory, and Agent-Based

Network Analysis of the SourceForge.net Community

γ : constant Goett 2 P(k) = k γ k : degree

(Social) Networks Analysis III. Prof. Dr. Daning Hu Department of Informatics University of Zurich

An Evolving Network Model With Local-World Structure

Erdős-Rényi Model for network formation

Constructing a G(N, p) Network

Constructing a G(N, p) Network

Properties of Biological Networks

Complex Networks. Structure and Dynamics

Overlay (and P2P) Networks

Wednesday, March 8, Complex Networks. Presenter: Jirakhom Ruttanavakul. CS 790R, University of Nevada, Reno

Critical Phenomena in Complex Networks

CSCI5070 Advanced Topics in Social Computing

A generative model for feedback networks

Exercise set #2 (29 pts)

CAIM: Cerca i Anàlisi d Informació Massiva

Detection of Resource-Drained Attacks on SIP-Based Wireless VoIP Networks

How Do Real Networks Look? Networked Life NETS 112 Fall 2014 Prof. Michael Kearns

Models of Network Formation. Networked Life NETS 112 Fall 2017 Prof. Michael Kearns

Constructing Numerical Reference Modes from Sparse Time Series

M.E.J. Newman: Models of the Small World

A Generating Function Approach to Analyze Random Graphs

Use of Extreme Value Statistics in Modeling Biometric Systems

Modeling and Performance Analysis with Discrete-Event Simulation

Empirical Study on Impact of Developer Collaboration on Source Code

Graph Sampling Approach for Reducing. Computational Complexity of. Large-Scale Social Network

Lesson 4. Random graphs. Sergio Barbarossa. UPC - Barcelona - July 2008

Bootstrapping Methods

arxiv: v1 [cs.si] 11 Jan 2019

Applying Social Network Analysis to the Information in CVS Repositories

HEURISTIC OPTIMIZATION USING COMPUTER SIMULATION: A STUDY OF STAFFING LEVELS IN A PHARMACEUTICAL MANUFACTURING LABORATORY

Analytic Performance Models for Bounded Queueing Systems

Microscopic Traffic Simulation

Analysis of the Social Community Based on the Network Growing Model in Open Source Software Community

Typeset with NdThesiS version 2.14 (2000/09/08) on November 17, 2003 THIS PAGE IS NOT PART OF THE THESIS, BUT SHOULD BE TURNED IN TO THE PROOFREADER!

Slides 11: Verification and Validation Models

Chapter 16. Microscopic Traffic Simulation Overview Traffic Simulation Models

Higher order clustering coecients in Barabasi Albert networks

Excel 2010 with XLSTAT

An Empirical Analysis of Communities in Real-World Networks

Analytical reasoning task reveals limits of social learning in networks

Package fastnet. September 11, 2018

Introduction to Networks and Business Intelligence

6.207/14.15: Networks Lecture 5: Generalized Random Graphs and Small-World Model

Clustering: Classic Methods and Modern Views

Specific Communication Network Measure Distribution Estimation

Package fastnet. February 12, 2018

ARMA MODEL SELECTION USING PARTICLE SWARM OPTIMIZATION AND AIC CRITERIA. Mark S. Voss a b. and Xin Feng.

Introduction to the Special Issue on AI & Networks

TELCOM2125: Network Science and Analysis

On Reshaping of Clustering Coefficients in Degreebased Topology Generators

A COMPUTER-AIDED SIMULATION ANALYSIS TOOL FOR SIMAN MODELS AUTOMATICALLY GENERATED FROM PETRI NETS

Note Set 4: Finite Mixture Models and the EM Algorithm

Graph Model Selection using Maximum Likelihood

Using the Kolmogorov-Smirnov Test for Image Segmentation

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

Supply chains involve complex webs of interactions among suppliers, manufacturers,

Distributed Detection in Sensor Networks: Connectivity Graph and Small World Networks

Random Generation of the Social Network with Several Communities

CLUSTERING. JELENA JOVANOVIĆ Web:

Summary: What We Have Learned So Far

The importance of networks permeates

Nonparametric Estimation of Distribution Function using Bezier Curve

Lecture: Simulation. of Manufacturing Systems. Sivakumar AI. Simulation. SMA6304 M2 ---Factory Planning and scheduling. Simulation - A Predictive Tool

THE EFFECT OF SEGREGATION IN NON- REPEATED PRISONER'S DILEMMA

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

UNIVERSITA DEGLI STUDI DI CATANIA FACOLTA DI INGEGNERIA

Open Source Software Developer and Project Networks

CS-E5740. Complex Networks. Scale-free networks

You ve already read basics of simulation now I will be taking up method of simulation, that is Random Number Generation

Probability Models.S4 Simulating Random Variables

Small World Properties Generated by a New Algorithm Under Same Degree of All Nodes

On Complex Dynamical Networks. G. Ron Chen Centre for Chaos Control and Synchronization City University of Hong Kong

The Simulation Study of Spread and Evolution about Network Opinion on Education

Descriptive Statistics, Standard Deviation and Standard Error

Structural Analysis of Paper Citation and Co-Authorship Networks using Network Analysis Techniques

The missing links in the BGP-based AS connectivity maps

A Comparison of Evaluation Networks and Collaboration Networks in Open Source Software Communities

Genetic Algorithms with Oracle for the Traveling Salesman Problem

arxiv: v1 [physics.soc-ph] 13 Jan 2011

Stable Statistics of the Blogograph

Complex networks: A mixture of power-law and Weibull distributions

Metaheuristic Development Methodology. Fall 2009 Instructor: Dr. Masoud Yaghini

Models for the growth of the Web

Graph Types. Peter M. Kogge. Graphs Types. Types of Graphs. Graphs: Sets (V,E) where E= {(u,v)}

A Statistical Method for Synthetic Power Grid Generation based on the U.S. Western Interconnection

Topologies and Centralities of Replied Networks on Bulletin Board Systems

Dynamic network generative model

Estimating Local Decision-Making Behavior in Complex Evolutionary Systems

Attack Vulnerability of Network with Duplication-Divergence Mechanism

Spectral Methods for Network Community Detection and Graph Partitioning

Statistical Analysis of List Experiments

Internet as a Complex Network. Guanrong Chen City University of Hong Kong

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Rethinking Preferential Attachment Scheme: Degree centrality versus closeness centrality 1

SIMULTANEOUS COMPUTATION OF MODEL ORDER AND PARAMETER ESTIMATION FOR ARX MODEL BASED ON MULTI- SWARM PARTICLE SWARM OPTIMIZATION

Developing Effect Sizes for Non-Normal Data in Two-Sample Comparison Studies

ENTITIES WITH COMBINED DISCRETE-CONTINUOUS ATTRIBUTES IN DISCRETE- EVENT-DRIVEN SYSTEMS

Sequences Modeling and Analysis Based on Complex Network

Transcription:

Towardsunderstanding: Astudy ofthe SourceForge.net community using modeling and simulation YongqinGao,GregMadey DepartmentofComputerScience and Engineering UniversityofNotreDame ygao,gmadey@nd.edu Keywords: Complex Network, Open Source Software Research, Agent Based Simulation, Verification and Validation Abstract The Open Source Software (OSS) development movement is a classic example of a social network; also it is a prototype of a complex evolving network. In this research, we used a iterated method to generate models depicting the evolution of this network. Then, using Java/Swarm, we simulated in our model the evolution of the OSS developer and project network and calibrated the model using the two year empirical data we collected from SourceForge. In simulation, we implemented several different models to replicate the empirical data. From these simulation iterations we were able to verify and validate our model for this network and to increase our understanding of the OSS developer collaboration network. As a result, we generated a fit model which can replicate all measures we inspected in the set. INTRODUCTION Open source software, usually created by volunteer programmers dispersed worldwide, now competes with that developed by commercial software firms. This is due in part because closed-source proprietary software has associated risks to users. These risks may be summarized as systems taking too long to develop, costing too much and not working very well when delivered. Open source software offers an opportunity to surmount those disadvantage of proprietary software. SourceForge is one of the most popular OSS hosting web sites, which offers features like bug tracking, project management, forum service, mailing list distribution, CVS and more. In January 7 it had almost.5 millions registered users and almost thousands registered projects. It is a good sample of the OSS community to study the underlying mechanisms of OSS communities. The collaboration relationships between the developers in the OSS community are the focus of our study. These relationships can be studied if we model them as a complex network. Scientists and mathematicians have been studying such Open source is a certification mark owned by the Open Source Initiative(http://www.opensource.org) http://sourceforge.net networks for some time. Different from the traditional networks computer scientists study, randomness is dominant in this kind of network, since there is no centralized control of the construction and evolution of the network, every node in the network will determine what nodes to connect to itself based on the local knowledge it has. In this paper, we will use different network models for the complex networks, including random networks [6], Small world networks [], and Scale-free networks []. Finally, we will propose a modified scale-free network model to fit the OSS community. Previous research involving the simulation of complex networks is based on snapshots [,, ], which focused on the topology of the network at a given point in time. Since we collected data over a two year period, we were able to look into the evolution of the network. This helped us understand the evolution of the network topology, the development patterns of any single object in the network and the impact of the interaction among objects on the evolution of the overall network system. Preliminary reports on this research have been published in proceedings of NAACSOS [8, 9] and ADS [7]. The remainder of this paper is structured as follows. In section, we explain our approach to do the modeling and simulation including methodology and experiment setup. In Section, we discuss the model we generated to depict the SourceForge community. The result and analysis of the simulation we build for the SourceForge community and the verification and validation of the simulation are provided in Section. Finally, we conclude in Section 5. OUR APPROACH Simulation Methodology We will use the same simulation procedure we proposed in the previous study [, ], which is an iterated method to do modeling and simulation. The main goal of the iterations is to get a fit model to describe the evolution of a complex network by simulation. Verification and validation is the most important step in the simulation iterations []. By comparing against the empirical data set repeatly, verification can make the model converge to the empirical system. There are two important components in the verification step - measures and methods. The measures are the criterion for the verification and the methods are how SpringSim Vol. 5 ISBN -56555--6

the verification is carried out. In our study, we used both statistical and network measures:. Network measures includes degree distribution, average diameter, average betweenness and so on. These measures describe the characteristics of the network structure and its evolution pattern.. Statistical measures includes monthly pageviews, downloads, number of developers in the group and so on. These measures describe the characteristics of individual entity in the network and its evolution pattern. For the verification methods, we used two verification methods to verify - visual inspection and statistical hypothesis test. Experiment Environment Our simulations are executed on a small computer cluster, which includes multiple simulation servers and duplicate database servers. These machines were connected by a M(bps) switched LAN (Local Area Network). The simulation servers, which are used to run the simulations, are all dual Xeon.6GHz with G of memory and G of swap. The database servers are used to host the repository database and working database. One (the repository database server) is dual Xeon.6GHz with G of memory, 6G of swap and T storage and the other (the working database server) is dual Xeon.6GHz with G of memory and G of swap. The operating systems of the cluster are as follows: The simulation servers are Linux with..-.elsmp kernel. The database servers are Linux with..-.elsmp kernel with PostgreSQL 7.. The simulation toolkit we used was Swarm and the programming language we used was Java. MODEL DESCRIPTION The model for simulation is like a procedure definition with related parameters. We are using the same procedure definition as our previous study [7]. In the model, there were two kinds of entities developer and project. Developers, which are the active roles in real SourceForge network, will also be the active entities (agents) in our model. Projects will stored in the simulation as an accessory data structure. Our model is a time-step based simulation, which means that every agent in the simulation is triggered to act at every time step. We define the time step in our model as reflecting a single day in the real world. At every time step, a given number of new developers, N(ND), will be generated and added into the simulation. In our model, the actual number of new developers at every time step was a random number meaned at N(ND). Then every existing developer (existing and new developers) in the simulation was triggered to act one by one. The procedure definitions for new developer and existing developer were different, which we will discuss separately. New developer. Every new developer is triggered to act right after it is generated. There are two possible actions for a new developer creating or joining. Creating means that the developer would create a new project and thus add a link from this developer to a new project. Joining means that the developer would select an existing project according to some rules, and thus a link from this developer to the project is added. P(CN) and P(JN) are the two parameters that controlled the probability of creating and joining. The one exemption is that the very first developer in the simulation has to create a new project because there are no existing projects in the system yet. Existing developer. Every existing developer is triggered to act at every time step. There are four possible actions for an existing developer creating, joining, abandoning and idling. Creating and joining are similar to the definitions for the new developer. Dropping means that the developer would drop a randomly selected project in which he had participated. Idling means that the developer did nothing, which is normally the most frequent action a developer will take. P(CO), P(JO), P(DO) and P(IO) are the four parameters to control the probability of creating, joining, dropping and idling. The probabilities that a developer would choose a certain action are different in different models for simulations based on different considered attributes. This is only the skeleton of the procedure definitions in our models. Different simulation models may have different procedure definitions. Most of these differences are related to the rules a developer used to choose an action and the rules a developer used to decide which project to join. Also, the related parameters may be different for different simulation iterations. All the procedure definitions are based on the skeleton procedure described in this section. We will explain the detailed procedure definitions and related parameters for every model when we discuss that specific model. RESULTS AND DISCUSSIONS In our previous study [, ], we iterated through adaptive ER model, BA model, BA model with constant fitness and BA model with dynamic fitness. As the starting point of current study - model one, we improved the last model in the previous study - BA model with dynamic fitness in the following aspects to better match the real SourceForge.net community. ISBN -56555--6 6 SpringSim Vol.

The first improvement is on the stochastic procedures. Since the networks we are simulating are complex networks, we will utilize multiple stochastic procedures in the model to replicate these complex characteristics. In our models, we have the following stochastic procedures: Generating new developers every time step; Generating fitness for new projects; Picking action at every time step for every user; Picking a project to join. The latter two stochastic procedures are well defined by developer preferential attachment and project preferential attachment discussed in the previous study. On the other hand, the first two stochastic procedures are not well defined in the previous models. Redefinition of these two stochastic procedures will be the first improvement in our first iterated model. In previous study, we assume that these two stochastic procedures are based on discrete uniform distribution. However, this is not a reasonable assumption in real networks like SourceForge.net community networks. We will replace the underlying distributions for these two stochastic procedures. For the first stochastic procedure generating new developers every time step, Poisson distribution is a reasonable approximation for the underlying distribution. Normally, it expresses the probability of a number of events occurring in a fixed period of time if these events occur with a known average rate, and are independent of the time since the last event. The probability density function for the Poisson distribution is F(k;λ) = e λ λ k where e is the base of the natural logarithm, k! is the factorial of k and λ is the expected number of occurrences that occur during the given interval. So we replaced the underlying distribution for the random number generator in the first stochastic procedure with the Poisson distribution with average λ at the average we inspected in the. For the second stochastic procedure initial fitness for new project, we used log-normal distribution as the underlying distribution for the stochastic procedure. The probability density function is k! f (x;µ,σ) = xσ /σ e (lnx µ) π for x >, where µ and σ are the mean and standard deviation of the variable s logarithm. The actual mean and standard deviation used in our model are picked according to the set to achieve a good fit. The second improvement is applying a periodical update procedure for project list. By applying these two improvements, we got our first model in this study. As we discussed before, we verified this model by comparing the network measures calculated from the set with the network measures calculated () () from the set. Our simulation starts from the very beginning of the network (the first node) and up to three simulated virtual years. The set starts from January and intermittently after that. The comparison will be taken in the overlapped time period, reflecting January to December for and simulated month to simulated month 6 in the simulation. This and other data from SourceForge.net is available for scholarly research from the authors [, ]. In addition to visual inspection that we used in the previous study, we will also use statistical hypothesis test to verify the match of the network measures. There are two kinds of statistics hypothesis test - parametric hypothesis testing and non-parametric hypothesis test. In our study, we can not make the assumption that the data set is normally distributed, so we will use non-parametric hypothesis test to verify the similarity of the measures from the simulation and the set. One of the tests we used is Chi-squared χ test for categorized data. The statistics χ is calculated by finding the difference between each observed and expected value, squaring them, dividing each by the theoretical value, and taking the sum of the results: χ = n i= (O i E i ) E i () where O i is an observed value and E i is an expected(empirical) value. We will use the χ test on verifying the network measures like average degrees, approximate diameter, approximate clustering coefficient, average betweenness and average closeness. The other test we used is the Kolmogorov-Smirnov K- S test for determining whether two underlying distributions of two data sets differ, or whether an underlying distribution differs from a hypothesized distribution, in either case based on finite samples. The two one-sided Kolmogorov-Smirnov test statistics are given by D + n = max(f n (x) F(x)) () D n = max(f(x) F n(x)) (5) The details about K-S test can be found at [5]. In our study, we used the K-S test to verify the degree distributions in the simulated data set match the corresponding distributions in the. The first network measure we inspected is the average degree and their development through the interested time period. The results are shown in Figure. The X coordinate is the time. The time range is [,], which is reflecting time range [January, December ] in the set and time range [month, month 6] for the simulated data set. The Y coordinate is the average degree value. All the SpringSim Vol. 7 ISBN -56555--6

. Developer in C NET Project in C NET...8.6...5.5 8 7.8 Diameter.7.75 CC 9 8.5 8 7.5 Developer in D NET 7.8.6. Project in P NET. Figure. s Comparison. x represents the and + represents the simulated data Approximate Diameter 7.6 7. 7. 7 6.8 Approximate Clustering Coefficient.7.75.7.75.7.75 four average degrees yield similar data points in empirical data and simulated data. We also verified this by using statistical hypothesis test by applying the χ test for all the average degrees. The χ for these four average degrees are.,.7,. and.7 respectively. By choosing significance level α at. and the d f value at, the critical value is.66. These values are all well below the critical value. Thus the null hypotheses are not rejected, which means the average degrees in the simulated virtual data set have no statistically significant difference from the average degrees in the set. Next, we verify the simulation results using the approximate diameter and approximate clustering coefficient. Figure shows the diameter development and the clustering coefficient development. Using visual inspection, we found the developments of diameter and clustering coefficient from the simulated data is similar to those of the. We also used the χ test to verify the similarity of the developments in the simulated data and the developments in the with α as. and d f as. The χ value for the diameter development and clustering coefficient development are.6 and.e-, which are well below the critical value at.66. So the null hypotheses are not rejected and thus the simulated data yields similar diameter and clustering coefficient development trends as the. The next network measures are the betweenness and closeness. we show the measures in Figure. By using the χ test, we found the χ value for the betweenness development and the closeness development are 5.8e-8 and.6e-9, which are well below the critical value.66 using α as. and d f as. So the null hypotheses are not rejected and, thus, the simulated data yields similar average betweenness and average closeness development trends as the. 6.6.7 Figure. Diameter and Clustering Coefficient Comparison. x represents the and + represents the simulated data Average Btweenness.69 x.68.67.66.65.6.6.6.6 Betweenness.6 Average Closeness.6 x 6.5.... Closeness. Figure. Betweenness and Closeness Comparison. x represents the and + represents the simulated data ISBN -56555--6 8 SpringSim Vol.

Developers Developers(LOG) 6 x 5 5 Participated projects 6 5 Dev Dist in C NET.5.5 Participated projects(log) Projects Projects(LOG).5 x.5.5.5 Proj Dist in C NET 5 5 Participated Developers 6 5 Participated developers(log) Figure. Degree Distributions Comparison. x represents the and o represents the simulated data Finally, we also verified the simulation model using the degree distributions. The degree distributions are shown in Figure. We used the K-S test to test whether the degree distributions in the simulated data set are the same as the degree distributions in the. The statistics for the developer degree distribution and project degree distribution are.75 and.979. Using the common significance level at.5, the hypotheses are not rejected. Thus both the developer degree distribution and project degree distribution in the simulated data are the same as the corresponding distributions in the. However, by visual inspection, we can still find differences between the degree distributions of the set and the simulated data set. For the developer degree distribution, the simulated data set has a significantly shorter tail than the. For the project degree distribution, the simulated data set has a significantly longer tail than the empirical data. These deficiencies (the busiest developer in the simulated data set participated in just projects instead of 5 in the and the biggest project in the simulated data set has more than 5 developers instead of 95 in the ) are not reasonable for the real world community such as SourceForge.net. So we iterated more models to make these deficiencies. The second iterated model is trying to make up for the deficiencies discovered in the previous model. To achieve this, we added a new parameter into the model - user energy. In the previous models, we introduced the fitness factor into the project preference. On the other hand, user preference is based only on the number of projects he/she participated. The deficiency in the previous model could be the result of lacking user fitness, so in this model we introduced a fitness factor for user preference, which is called user energy. We define the energy as a constant parameter γ i for every user, where γ i is chosen for a uniform continuous distribution between (,]. This energy level can be described as the probability of a user will take a action instead of idling during every time step. This model yielded similar results for most of the network measures as previous models except the degree distributions. The biggest project in the project distribution decreased from more than 5 to below 5. However, the developer distribution still has significant deficiency at the tail, compared to the same measures in the. Further iteration on the modeling introduced dynamic user energy into the model. In this model, we made the energy decay linearly by time and meanwhile it can be self-adjusted by taking roles and responsibilities or giving up roles and responsibilities in the community. By using this dynamic energy, we expected to catch the details of the possible influence of a user to the related project, even the whole community, and then generate better matches in both the developer distribution and the project distribution. The results of the simulation of this model improve the matches of developer distribution and project distribution. The degree distributions of this model yield similar results as the previous models except that the busiest user is increased to projects, which is more close the the while the size of the largest project is kept below. This is more reasonable tail for the SourceForge.net and close to the set. Thus this model is able to replicate all the network measures we used in the verification by both visual inspection and statistics hypothesis test methods. After analyzing and iterating the models, our third model reproduced all the inspected network measures we chose of the and captured all the phenomenon we wanted to highlight in the community and can be used to simulate the community and possibly predict the future of the SourceForge.net community by extending the simulations. CONCLUSION In this study, we used simulation methods to assist in understanding the topology and evolution of the Source- Forge.net developer collaboration network. In the study, we used an interactive method to develop more models for the SourceForge.net community. We iterated three more models from our previous study to generate the virtual SourceForge.net. In the simulation iteration, we verified the result by comparing the measures from the simulated data and measures from the using both visual inspection and statistical hypothesis test. The final result of the study is a model to simulate the SourceForge.net with matched network measures. Using this model, we are able to simulate the SourceForge.net community and possibly SpringSim Vol. 9 ISBN -56555--6

predict the future development of the community by simply extendingthesimulationtime. Acknowledgment.ThisworkwassupportedinpartbytheNationalScienceFoundation,CISE/IIS-DigitalSociety&Technology,underGrantNo.89[,]. REFERENCES [] OSS research(http://www.nd.edu/~oss). 6. [] OSS research portal (http://zerlot.cse.nd.edu). 6. [] R. Albert and A.L. Barabási. Emergence of scaling in random networks. Science, 86:59 5, 999. [] R. Bellman. Adaptive control processes: A guided tour. Princeton University Press, Princeton, NJ, 96. [5] I. M. Chakravarti, R. G. Laha and J. Roy. Handbook of methods of applied statistics. John Wiley and Sons, I, 967. [6] P. Erdös and A. Rényi. On random graphs. Publicationes Mathematicae, 6:9 97, 959. [7] Y. Gao, G. Madey and V. Freeh. Modeling and simulation of the open source software community. ADS 5, San Diego, 5. [8] Y. Gao, V. Freeh and G. Madey. Analysis and modeling of the open source software community. NAACSOS, Pittsburgh,. [9] Y. Gao, V. Freeh and G. Madey. Conceptual framework for agent-based modeling and simulation. NAACSOS, Pittsburgh,. [] G. Madey, V. Freeh and R. Tynan. Agent-based modeling of open source using swarm. Eight Americas Conference of Information Systems,. [] G. Madey, V. Freeh and R. Tynan. The open source software development phenomenon: an analysis based on social network theory. Eight Americas Conference of Information Systems,. [] M.E.J. Newman and D.J. Watts. Random graph models of social networks. Physics Review, 6(68),. [] D.J. Watts and S. H. Strogatz. Collective dynamics of small-world networks. Nature, 9:, 998. ISBN -56555--6 5 SpringSim Vol.