An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network Theory, and Agent-Based

Similar documents
Towardsunderstanding: Astudy ofthe SourceForge.net community using modeling and simulation

γ : constant Goett 2 P(k) = k γ k : degree

Network Analysis of the SourceForge.net Community

(Social) Networks Analysis III. Prof. Dr. Daning Hu Department of Informatics University of Zurich

Open Source Software Developer and Project Networks

Typeset with NdThesiS version 2.14 (2000/09/08) on November 17, 2003 THIS PAGE IS NOT PART OF THE THESIS, BUT SHOULD BE TURNED IN TO THE PROOFREADER!

Structural Analysis of Paper Citation and Co-Authorship Networks using Network Analysis Techniques

Wednesday, March 8, Complex Networks. Presenter: Jirakhom Ruttanavakul. CS 790R, University of Nevada, Reno

A Generating Function Approach to Analyze Random Graphs

Applying Social Network Analysis to the Information in CVS Repositories

Characteristics of Preferentially Attached Network Grown from. Small World

M.E.J. Newman: Models of the Small World

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization

Complex networks Phys 682 / CIS 629: Computational Methods for Nonlinear Systems

Properties of Biological Networks

Overlay (and P2P) Networks

An Evolving Network Model With Local-World Structure

UNIVERSITA DEGLI STUDI DI CATANIA FACOLTA DI INGEGNERIA

How Do Real Networks Look? Networked Life NETS 112 Fall 2014 Prof. Michael Kearns

Critical Phenomena in Complex Networks

Erdős-Rényi Model for network formation

Network Thinking. Complexity: A Guided Tour, Chapters 15-16

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

Dynamic network generative model

Chapter 1. Social Media and Social Computing. October 2012 Youn-Hee Han

- relationships (edges) among entities (nodes) - technology: Internet, World Wide Web - biology: genomics, gene expression, proteinprotein

On Reshaping of Clustering Coefficients in Degreebased Topology Generators

Introduction to the Special Issue on AI & Networks

Constructing a G(N, p) Network

Exercise set #2 (29 pts)

FLOSSmole, FLOSShub and the SRDA Repositories

Lesson 4. Random graphs. Sergio Barbarossa. UPC - Barcelona - July 2008

Stable Statistics of the Blogograph

Web 2.0 Social Data Analysis

Nick Hamilton Institute for Molecular Bioscience. Essential Graph Theory for Biologists. Image: Matt Moores, The Visible Cell

Constructing a G(N, p) Network

Random Generation of the Social Network with Several Communities

Introduction to Computational Modeling of Social Systems

Math 443/543 Graph Theory Notes 10: Small world phenomenon and decentralized search

MODELS FOR EVOLUTION AND JOINING OF SMALL WORLD NETWORKS

Introduction to network metrics

Introduction to Networks and Business Intelligence

Complex Networks. Structure and Dynamics

6. Overview. L3S Research Center, University of Hannover. 6.1 Section Motivation. Investigation of structural aspects of peer-to-peer networks

Small World Properties Generated by a New Algorithm Under Same Degree of All Nodes

The importance of networks permeates

Failure in Complex Social Networks

Case Studies in Complex Networks

The missing links in the BGP-based AS connectivity maps

Graph Sampling Approach for Reducing. Computational Complexity of. Large-Scale Social Network

CAIM: Cerca i Anàlisi d Informació Massiva

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Summary: What We Have Learned So Far

Research on Community Structure in Bus Transport Networks

A Comparison of Evaluation Networks and Collaboration Networks in Open Source Software Communities

Topology Enhancement in Wireless Multihop Networks: A Top-down Approach

Analysis of the Social Community Based on the Network Growing Model in Open Source Software Community

Small-World Models and Network Growth Models. Anastassia Semjonova Roman Tekhov

A SourceForge.net Project: tmans, an Agentbased Neural Network Simulator, Repast, and SourceForge CVS

Higher order clustering coecients in Barabasi Albert networks

Example 1: An algorithmic view of the small world phenomenon

Peer-to-Peer Data Management

arxiv:cond-mat/ v1 21 Oct 1999

Topologies and Centralities of Replied Networks on Bulletin Board Systems

V 1 Introduction! Mon, Oct 15, 2012! Bioinformatics 3 Volkhard Helms!

Graph Theory. Graph Theory. COURSE: Introduction to Biological Networks. Euler s Solution LECTURE 1: INTRODUCTION TO NETWORKS.

An Empirical Analysis of Communities in Real-World Networks

CS-E5740. Complex Networks. Scale-free networks

Basics of Network Analysis

1 Random Graph Models for Networks

On Complex Dynamical Networks. G. Ron Chen Centre for Chaos Control and Synchronization City University of Hong Kong

Facilitating Social Network Studies of FLOSS using the OSSNetwork Environment

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

Economic Networks. Theory and Empirics. Giorgio Fagiolo Laboratory of Economics and Management (LEM) Sant Anna School of Advanced Studies, Pisa, Italy

Topic II: Graph Mining

Supply chains involve complex webs of interactions among suppliers, manufacturers,

Building the Multiplex: An Agent-Based Model of Formal and Informal Network Relations

Complex networks: A mixture of power-law and Weibull distributions

Models of Network Formation. Networked Life NETS 112 Fall 2017 Prof. Michael Kearns

CSI33 Data Structures

Heuristics for the Critical Node Detection Problem in Large Complex Networks

Algorithmic and Economic Aspects of Networks. Nicole Immorlica

Small World Graph Clustering

Examples of Complex Networks

Structural and Temporal Properties of and Spam Networks

Machine Learning and Modeling for Social Networks

Evolution of Open Source Software Networks

The Establishment Game. Motivation

Estimating Local Decision-Making Behavior in Complex Evolutionary Systems

Complexity in Network Economics

The Directed Closure Process in Hybrid Social-Information Networks, with an Analysis of Link Formation on Twitter

Response Network Emerging from Simple Perturbation

Discrete-Event Simulation: A First Course. Steve Park and Larry Leemis College of William and Mary

Mining and Analyzing Online Social Networks

Empirical analysis of online social networks in the age of Web 2.0

Pre-Requisites: CS2510. NU Core Designations: AD

Topology Generation for Web Communities Modeling

Disclaimer. Lect 2: empirical analyses of graphs

Scalable P2P architectures

Jure Leskovec, Cornell/Stanford University. Joint work with Kevin Lang, Anirban Dasgupta and Michael Mahoney, Yahoo! Research

Transcription:

An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network Theory, and Agent-Based Greg Madey Computer Science & Engineering University of Notre Dame UIUC - NSF Workshop on Continuous (Re)Design of Open Source Software University of Illinois, Urbana-Champaign October 8-9, 2003 This research was partially supported by the US National Science Foundation, CISE/IIS- Digital Society & Technology, under Grant No. 0222829

Contributors Vincent Freeh,, Computer Science, North Carolina State University (Principal Investigator) Yongqin Gao,, Computer Science and Engineering, University of Notre Dame (Graduate Student) Jeff Goett,, University of Notre Dame (REU Student) Chris Hoffman, University of Notre Dame (REU Student) Nadir Kiyanclar,, University of Notre Dame (REU Student) Greg Madey,, Computer Science & Engineering, University of Notre Dame (Principal Investigator) Patrick McGovern, Director SourceForge.net, VA Software (Industrial Collaborator) Carlos Siu,, University of Notre Dame (REU Student) Renee Tynan,, Department of Management, College of Business, University of Notre Dame (Principal Investigator) Jin Xu,, Computer Science & Engineering, University of Notre Dame (Graduate Student)

Outline Research approach Tools and definitions: Agents, models, simulations, collaborative social networks, computer experiments Data collection and analysis Example research question Simulation Computer experiments Results

One Approach to Researching Online data Screen scraping Database dumps Modeling Social network theory Evolutionary assumptions Simulation Verification and validation F/OSSD Computer experiments Variation of Classical Scientific Method

Classical Scientific Method 1. Observe the world a) Identify a puzzling phenomenon 2. Generate a falsifiable hypothesis (K. Popper) 3. Design and conduct an experiment with the goal of disproving the hypothesis a) If the experiment fails,, then the hypothesis is accepted (until replaced) b) If the experiment succeeds,, then reject hypothesis, but additional insight into the phenomenon may be obtained and steps 2-3 repeated

The Computer Experiment

Agent-Based Simulation as a Component of the Scientific Method Modeling (Hypothesis) Observation Agent -Based Simulation (Experiment)

Agent-Based Simulation as a Component of the Scientific Method Modeling (Hypothesis) Social Network Model of F/OSS Observation Analysis of SourceForge Data Agent -Based Simulation (Experiment) Grow Artificial SourceForge

Agent-Based Modeling and Simulation Conceptual models of a phenomenon Simulations are computer implementations of the conceptual models Agents in models and simulations are distinct entities (instantiated objects) Tend to be simple, but with large numbers of them (thousands, or more) - i.e., swarm intelligence Contrasted with higher level AI intelligent agents Foundations in complexity theory Self-organization Emergence

Collaborative Social Networks Research-paper co-authorship, small world phenomenon, e.g., Erdos number (Barabasi 2001, Newman 2001) Movie actors, small world phenomenon, e.g., Kevin Bacon number (Watts 1999, 2003) Interlocking corporate directorships Terrorist Networks Open-source software developers (Madey et al, AMCIS 2002) Collaborators are nodes in a graph, and collaborative relationship are the edges of the graph => a framework to model data/phenomenon

SourceForge VA Software Part of OSDN Started 12/1999 Collaboration tools 70,000 Projects 90,000 Developers 700,00 Registered Users

Savannah SourceForge Software? Free Software Foundation 1,600 Projects 16,000 Registered Users

Observations Web mining Web crawler (scripts) Python Perl AWK Sed Monthly Since Jan 2001 ProjectID DeveloperID Almost 2 million records Relational database PROJ DEVELOPER 8001 dev378 8001 dev8975 8001 dev9972 8002 dev27650 8005 dev31351 8006 dev12509 8007 dev19395 8007 dev4622 8007 dev35611 8008 dev8975

Collaboration Networks Adapted from Newman, Strogatz and Watts, 2001

F/OSS Developers - Collaboration Social Network Developers are nodes / Projects are links 24 Developers 5 Projects 2 Linchpin Developers 1 Cluster Project 7597 dev[64] Project 6882 dev[72] dev[67] dev[47] 6882 dev[47] dev[52] 6882 dev[47] dev[55] 6882 dev[47] 6882 dev[58] dev[79] dev[47] dev[79] dev[52] dev[55] dev[58] dev[83] Project 15850 Project 7028 dev[99] dev[51] 15850 dev[46] dev[58] dev[57] 7597 dev[46] 7028 dev[46] dev[70] 7028 dev[46] dev[57] dev[99] 7028 dev[46] dev[51] dev[46] 15850 dev[46] 15850 dev[46] dev[56] dev[83] 15850 dev[46] dev[48] dev[48] dev[70] 7597 dev[46] dev[72] dev[56] 7597 dev[46] dev[64] 7597 dev[46] dev[67] 7597 dev[46] dev[55] 7597 dev[46] dev[45] 7597 dev[46] dev[61] 7597 dev[46] dev[58] 9859 dev[46] dev[54] 9859 dev[46] 9859 dev[46] dev[49] dev[53] 9859 dev[46] dev[59] dev[53] dev[54] dev[58] dev[59] dev[49] Project 9859 dev[65] dev[45] dev[61]

Topological Analysis of the Data Statistics inspected Diameter Average degree Clustering coefficient Degree distribution Cluster size distribution Relative size of major cluster Fitness and life cycle Evolution of these statistics Dual networks developer network and project network

Terminology Diameter Average length of shortest paths between all pairs of vertices Degree The count of edges connected to given vertex Average degree Average of the degrees of all vertices in the network Cluster The connected components of the network Clustering coefficient (CC) CC i : Fraction representing the number of links actually present relative to the total possible number of links among the vertices in its neighborhood. CC: average of all CC i in a network Degree distribution The distribution of degrees throughout a network Major cluster The largest cluster in the network

Degree Distribution: Developers

Degree Distribution: Projects

Diameter of Developer Network vs. Time Network size increased from 30,000 to 70,000

Diameter of Project Network vs. Time Network size increased from 20,000 to 50,000. Diameter decreasing with time both for developer network and project network

Clustering Coefficient of Developer Network vs. Time

Clustering Coefficient of Project Network vs. Time

Cluster Size Distribution R 2 with major cluster is 0.7426 R 2 without major cluster is 0.9799

Relative Size of Major Cluster vs. Time Increase of the relative size of the major cluster Approaching steady-state?

An Example Research Question What processes can explain the evolution of the project and developer social networks? Randomly growing network (Erdos( Erdos-Reyni,, 1960)? Evolving network with preferential attachment (Barabasi-Albert, 1999)? Evolving network with preferential attachment and fitness (Barabasi( Barabasi-Albert, 2001)? Others?

Computer Experiments Agent-based simulations Java programs using Swarm class library Validation (docking) exercises using Java/Repast Grow artificial SourceForge SourceForge s (Epstein & Axtell, 1996) Parameterized with observed data, e.g., developer behaviors Join rates New project additions Leave projects Evaluation of multiple models (hypotheses) Verification/validation

Cycles of Modeling & Simulation Modeling (Hypothesis) Social Network Models ER => BA => BA+Fitness => BA+Dynamic Fitness Observation Analysis of SourceForge Data Degree Distribution Average Degree Diameter Clustering Coefficient Cluster Size Distribution Agent -Based Simulation (Experiment) Grow Artificial SourceForge

Model for SourceForge ABM based on bipartite graph Model description Agent: developer Behaviors: Create, join, abandon and idle Preference: developer s s and project s Fitness Four models in iterations ER, BA, BA with constant fitness and BA with dynamic fitness Comparison of empirical and simulated data

ER Model Degree Distribution Degree distribution is normal distribution while it is power law in empirical data Fit Fails!

ER Model - Diameter Average degree is decreasing while it is increasing in empirical data Diameter is increasing while it is decreasing in empirical data Fit Fails!

ER Model Clustering Coefficient Clustering coefficient is relatively low under 0.3 while it is around 0.7 in empirical data. Fit fails!

ER Model Cluster Size Distribution Power law distribution with R 2 as 0.6667 (0.9653 without the major cluster) while R 2 in empirical data is 0.7426 (0.9799 without the major cluster) The actual distribution is different from empirical data Fit Fails!

BA Model Degree Distribution Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data). For developer distribution: simulated data has R 2 as 0.9798 and empirical data has R 2 as 0.9714. For project distribution: simulated data has R 2 as 0.6650 and empirical data has R 2 as 0.9838. Partial Fit!

BA Model Diameter and Clustering Coefficient Small diameter and high clustering coefficient like empirical data Diameter and clustering coefficient are both decreasing like empirical data Good Fit!

BA Model with Constant Fitness Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data). For developer distribution: simulated data has R 2 as 0.9742 and empirical data has R 2 as 0.9714. For project distribution: simulated data has R 2 as 0.7253 and empirical data has R 2 as 0.9838. Improved fit!

Discovery: Project Life Cycle

BA Model with Dynamic Fitness Power laws in degree distribution, similar to empirical data (o for simulated data and x for empirical data). For developer distribution: simulated data has R 2 as 0.9695 and empirical data has R 2 as 0.9714. For project distribution: simulated data has R 2 as 0.8051 and empirical data has R 2 as 0.9838. Somewhat better fit!

Models of the F/OSS Social Network (Alternative Hypotheses) General model features Agents are nodes on a graph (developers or projects) Behaviors: Create, join, abandon and idle Edges are relationships (joint project participation) Growth of network: random or types of preferential attachment, formation of clusters Fitness Network attributes: diameter, average degree, degree distribution, clustering coefficient Four specific models ER (random graph) - (1960) BA (preferential attachment) - (1999) BA ( + constant fitness) - (2001) BA ( + dynamic fitness) - (2003)

Summary

Summary Why Agent-Based Modeling and Simulation? Can be used as components of the Scientific Method A research approach for studying socio-technical systems Case study: F/OSS - Collaboration Social Networks SourceForge conceptual models: ER, BA, BA with constant fitness and BA with dynamic fitness. Simulations Computer experiments that tested conceptual models Provided insight into the phenomenon under study and guided data mining of collected observations

Questions Validity of approaches Social networks Simulation Value/Utility of approachs Applicability to other areas of F/OSS research Project sites, e.g., Mozilla.org Individual projects, e.g., Linux kernel

Thank you