An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network Theory, and Agent-Based Greg Madey Computer Science & Engineering University of Notre Dame UIUC - NSF Workshop on Continuous (Re)Design of Open Source Software University of Illinois, Urbana-Champaign October 8-9, 2003 This research was partially supported by the US National Science Foundation, CISE/IIS- Digital Society & Technology, under Grant No. 0222829
Contributors Vincent Freeh,, Computer Science, North Carolina State University (Principal Investigator) Yongqin Gao,, Computer Science and Engineering, University of Notre Dame (Graduate Student) Jeff Goett,, University of Notre Dame (REU Student) Chris Hoffman, University of Notre Dame (REU Student) Nadir Kiyanclar,, University of Notre Dame (REU Student) Greg Madey,, Computer Science & Engineering, University of Notre Dame (Principal Investigator) Patrick McGovern, Director SourceForge.net, VA Software (Industrial Collaborator) Carlos Siu,, University of Notre Dame (REU Student) Renee Tynan,, Department of Management, College of Business, University of Notre Dame (Principal Investigator) Jin Xu,, Computer Science & Engineering, University of Notre Dame (Graduate Student)
Outline Research approach Tools and definitions: Agents, models, simulations, collaborative social networks, computer experiments Data collection and analysis Example research question Simulation Computer experiments Results
One Approach to Researching Online data Screen scraping Database dumps Modeling Social network theory Evolutionary assumptions Simulation Verification and validation F/OSSD Computer experiments Variation of Classical Scientific Method
Classical Scientific Method 1. Observe the world a) Identify a puzzling phenomenon 2. Generate a falsifiable hypothesis (K. Popper) 3. Design and conduct an experiment with the goal of disproving the hypothesis a) If the experiment fails,, then the hypothesis is accepted (until replaced) b) If the experiment succeeds,, then reject hypothesis, but additional insight into the phenomenon may be obtained and steps 2-3 repeated
The Computer Experiment
Agent-Based Simulation as a Component of the Scientific Method Modeling (Hypothesis) Observation Agent -Based Simulation (Experiment)
Agent-Based Simulation as a Component of the Scientific Method Modeling (Hypothesis) Social Network Model of F/OSS Observation Analysis of SourceForge Data Agent -Based Simulation (Experiment) Grow Artificial SourceForge
Agent-Based Modeling and Simulation Conceptual models of a phenomenon Simulations are computer implementations of the conceptual models Agents in models and simulations are distinct entities (instantiated objects) Tend to be simple, but with large numbers of them (thousands, or more) - i.e., swarm intelligence Contrasted with higher level AI intelligent agents Foundations in complexity theory Self-organization Emergence
Collaborative Social Networks Research-paper co-authorship, small world phenomenon, e.g., Erdos number (Barabasi 2001, Newman 2001) Movie actors, small world phenomenon, e.g., Kevin Bacon number (Watts 1999, 2003) Interlocking corporate directorships Terrorist Networks Open-source software developers (Madey et al, AMCIS 2002) Collaborators are nodes in a graph, and collaborative relationship are the edges of the graph => a framework to model data/phenomenon
SourceForge VA Software Part of OSDN Started 12/1999 Collaboration tools 70,000 Projects 90,000 Developers 700,00 Registered Users
Savannah SourceForge Software? Free Software Foundation 1,600 Projects 16,000 Registered Users
Observations Web mining Web crawler (scripts) Python Perl AWK Sed Monthly Since Jan 2001 ProjectID DeveloperID Almost 2 million records Relational database PROJ DEVELOPER 8001 dev378 8001 dev8975 8001 dev9972 8002 dev27650 8005 dev31351 8006 dev12509 8007 dev19395 8007 dev4622 8007 dev35611 8008 dev8975
Collaboration Networks Adapted from Newman, Strogatz and Watts, 2001
F/OSS Developers - Collaboration Social Network Developers are nodes / Projects are links 24 Developers 5 Projects 2 Linchpin Developers 1 Cluster Project 7597 dev[64] Project 6882 dev[72] dev[67] dev[47] 6882 dev[47] dev[52] 6882 dev[47] dev[55] 6882 dev[47] 6882 dev[58] dev[79] dev[47] dev[79] dev[52] dev[55] dev[58] dev[83] Project 15850 Project 7028 dev[99] dev[51] 15850 dev[46] dev[58] dev[57] 7597 dev[46] 7028 dev[46] dev[70] 7028 dev[46] dev[57] dev[99] 7028 dev[46] dev[51] dev[46] 15850 dev[46] 15850 dev[46] dev[56] dev[83] 15850 dev[46] dev[48] dev[48] dev[70] 7597 dev[46] dev[72] dev[56] 7597 dev[46] dev[64] 7597 dev[46] dev[67] 7597 dev[46] dev[55] 7597 dev[46] dev[45] 7597 dev[46] dev[61] 7597 dev[46] dev[58] 9859 dev[46] dev[54] 9859 dev[46] 9859 dev[46] dev[49] dev[53] 9859 dev[46] dev[59] dev[53] dev[54] dev[58] dev[59] dev[49] Project 9859 dev[65] dev[45] dev[61]
Topological Analysis of the Data Statistics inspected Diameter Average degree Clustering coefficient Degree distribution Cluster size distribution Relative size of major cluster Fitness and life cycle Evolution of these statistics Dual networks developer network and project network
Terminology Diameter Average length of shortest paths between all pairs of vertices Degree The count of edges connected to given vertex Average degree Average of the degrees of all vertices in the network Cluster The connected components of the network Clustering coefficient (CC) CC i : Fraction representing the number of links actually present relative to the total possible number of links among the vertices in its neighborhood. CC: average of all CC i in a network Degree distribution The distribution of degrees throughout a network Major cluster The largest cluster in the network
Degree Distribution: Developers
Degree Distribution: Projects
Diameter of Developer Network vs. Time Network size increased from 30,000 to 70,000
Diameter of Project Network vs. Time Network size increased from 20,000 to 50,000. Diameter decreasing with time both for developer network and project network
Clustering Coefficient of Developer Network vs. Time
Clustering Coefficient of Project Network vs. Time
Cluster Size Distribution R 2 with major cluster is 0.7426 R 2 without major cluster is 0.9799
Relative Size of Major Cluster vs. Time Increase of the relative size of the major cluster Approaching steady-state?
An Example Research Question What processes can explain the evolution of the project and developer social networks? Randomly growing network (Erdos( Erdos-Reyni,, 1960)? Evolving network with preferential attachment (Barabasi-Albert, 1999)? Evolving network with preferential attachment and fitness (Barabasi( Barabasi-Albert, 2001)? Others?
Computer Experiments Agent-based simulations Java programs using Swarm class library Validation (docking) exercises using Java/Repast Grow artificial SourceForge SourceForge s (Epstein & Axtell, 1996) Parameterized with observed data, e.g., developer behaviors Join rates New project additions Leave projects Evaluation of multiple models (hypotheses) Verification/validation
Cycles of Modeling & Simulation Modeling (Hypothesis) Social Network Models ER => BA => BA+Fitness => BA+Dynamic Fitness Observation Analysis of SourceForge Data Degree Distribution Average Degree Diameter Clustering Coefficient Cluster Size Distribution Agent -Based Simulation (Experiment) Grow Artificial SourceForge
Model for SourceForge ABM based on bipartite graph Model description Agent: developer Behaviors: Create, join, abandon and idle Preference: developer s s and project s Fitness Four models in iterations ER, BA, BA with constant fitness and BA with dynamic fitness Comparison of empirical and simulated data
ER Model Degree Distribution Degree distribution is normal distribution while it is power law in empirical data Fit Fails!
ER Model - Diameter Average degree is decreasing while it is increasing in empirical data Diameter is increasing while it is decreasing in empirical data Fit Fails!
ER Model Clustering Coefficient Clustering coefficient is relatively low under 0.3 while it is around 0.7 in empirical data. Fit fails!
ER Model Cluster Size Distribution Power law distribution with R 2 as 0.6667 (0.9653 without the major cluster) while R 2 in empirical data is 0.7426 (0.9799 without the major cluster) The actual distribution is different from empirical data Fit Fails!
BA Model Degree Distribution Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data). For developer distribution: simulated data has R 2 as 0.9798 and empirical data has R 2 as 0.9714. For project distribution: simulated data has R 2 as 0.6650 and empirical data has R 2 as 0.9838. Partial Fit!
BA Model Diameter and Clustering Coefficient Small diameter and high clustering coefficient like empirical data Diameter and clustering coefficient are both decreasing like empirical data Good Fit!
BA Model with Constant Fitness Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data). For developer distribution: simulated data has R 2 as 0.9742 and empirical data has R 2 as 0.9714. For project distribution: simulated data has R 2 as 0.7253 and empirical data has R 2 as 0.9838. Improved fit!
Discovery: Project Life Cycle
BA Model with Dynamic Fitness Power laws in degree distribution, similar to empirical data (o for simulated data and x for empirical data). For developer distribution: simulated data has R 2 as 0.9695 and empirical data has R 2 as 0.9714. For project distribution: simulated data has R 2 as 0.8051 and empirical data has R 2 as 0.9838. Somewhat better fit!
Models of the F/OSS Social Network (Alternative Hypotheses) General model features Agents are nodes on a graph (developers or projects) Behaviors: Create, join, abandon and idle Edges are relationships (joint project participation) Growth of network: random or types of preferential attachment, formation of clusters Fitness Network attributes: diameter, average degree, degree distribution, clustering coefficient Four specific models ER (random graph) - (1960) BA (preferential attachment) - (1999) BA ( + constant fitness) - (2001) BA ( + dynamic fitness) - (2003)
Summary
Summary Why Agent-Based Modeling and Simulation? Can be used as components of the Scientific Method A research approach for studying socio-technical systems Case study: F/OSS - Collaboration Social Networks SourceForge conceptual models: ER, BA, BA with constant fitness and BA with dynamic fitness. Simulations Computer experiments that tested conceptual models Provided insight into the phenomenon under study and guided data mining of collected observations
Questions Validity of approaches Social networks Simulation Value/Utility of approachs Applicability to other areas of F/OSS research Project sites, e.g., Mozilla.org Individual projects, e.g., Linux kernel
Thank you