Transcript Slide 1

An Investigation into the Free/Open
Source Software Phenomenon using
Data Mining, Social Network Theory,
and Agent-Based
Greg Madey
Computer Science & Engineering
University of Notre Dame
UIUC - NSF Workshop on Continuous (Re)Design of
Open Source Software
University of Illinois, Urbana-Champaign
October 8-9, 2003
This research was partially supported by the US National Science Foundation, CISE/IISDigital Society & Technology, under Grant No. 0222829
Contributors
• Vincent Freeh, Computer Science, North Carolina State University (Principal
Investigator)
• Yongqin Gao, Computer Science and Engineering, University of Notre Dame
(Graduate Student)
• Jeff Goett, University of Notre Dame (REU Student)
• Chris Hoffman, University of Notre Dame (REU Student)
• Nadir Kiyanclar, University of Notre Dame (REU Student)
• Greg Madey, Computer Science & Engineering, University of Notre Dame
(Principal Investigator)
• Patrick McGovern, Director SourceForge.net, VA Software (Industrial
Collaborator)
• Carlos Siu, University of Notre Dame (REU Student)
• Renee Tynan, Department of Management, College of Business, University of
Notre Dame (Principal Investigator)
• Jin Xu, Computer Science & Engineering, University of Notre Dame (Graduate
Student)
Outline
• Research approach
• Tools and definitions: Agents, models, simulations,
collaborative social networks, computer experiments
• Data collection and analysis
• Example research question
• Simulation
• Computer experiments
• Results
One Approach to Researching
F/OSSD
• Online data
– Screen scraping
– Database dumps
• Modeling
– Social network theory
– Evolutionary assumptions
• Simulation
– Verification and validation
– Computer experiments
• Variation of Classical Scientific Method
Classical Scientific Method
1. Observe the world
a) Identify a puzzling phenomenon
2. Generate a falsifiable hypothesis (K. Popper)
3. Design and conduct an experiment with the
goal of disproving the hypothesis
a) If the experiment “fails”, then the hypothesis is
accepted (until replaced)
b) If the experiment “succeeds”, then reject
hypothesis, but additional insight into the
phenomenon may be obtained and steps 2-3
repeated
The Computer Experiment
Agent-Based Simulation as
a Component of the
Scientific Method
Modeling
(Hypothesis)
Observation
Agent -Based
Simulation
(Experiment)
Agent-Based Simulation as
a Component of the
Scientific Method
Modeling
(Hypothesis)
Social Network
Model of F/OSS
Observation
Analysis of
SourceForge
Data
Agent -Based
Simulation
(Experiment)
Grow Artificial
SourceForge
Agent-Based Modeling and
Simulation
• Conceptual models of a phenomenon
• Simulations are computer implementations of the
conceptual models
• Agents in models and simulations are distinct entities
(instantiated objects)
– Tend to be simple, but with large numbers of them (thousands, or
more) - i.e., swarm intelligence
– Contrasted with higher level AI “intelligent agents”
• Foundations in complexity theory
– Self-organization
– Emergence
Collaborative Social Networks
• Research-paper co-authorship, small world phenomenon, e.g., Erdos
number (Barabasi 2001, Newman 2001)
• Movie actors, small world phenomenon, e.g., Kevin Bacon number
(Watts 1999, 2003)
•
•
•
•
Interlocking corporate directorships
Terrorist Networks
Open-source software developers (Madey et al, AMCIS 2002)
Collaborators are nodes in a graph, and collaborative relationship are
the edges of the graph => a framework to model data/phenomenon
SourceForge
• VA Software
• Part of OSDN
• Started 12/1999
• Collaboration tools
• 70,000 Projects
• 90,000 Developers
• 700,00 Registered
Users
Savannah
• SourceForge
Software?
• Free Software
Foundation
•1,600 Projects
•16,000 Registered
Users
Observations
• Web mining
• Web crawler (scripts)
–
–
–
–
•
•
•
•
•
•
Python
Perl
AWK
Sed
Monthly
Since Jan 2001
ProjectID
DeveloperID
Almost 2 million records
Relational database
PROJ|DEVELOPER
8001|dev378
8001|dev8975
8001|dev9972
8002|dev27650
8005|dev31351
8006|dev12509
8007|dev19395
8007|dev4622
8007|dev35611
8008|dev8975
Collaboration Networks
Adapted from Newman, Strogatz and Watts, 2001
F/OSS Developers - Collaboration Social Network
Developers are nodes / Projects are links
24 Developers
5 Projects
2 Linchpin Developers
1 Cluster
Project 7597
dev[64]
dev[72]
dev[67]
Project 6882
Project 7028
dev[52]
dev[65]
dev[70]
dev[57]
6882 dev[47]
dev[52]
6882 dev[47]
dev[55]
dev[47]
dev[99]
dev[55]
dev[51]
6882 dev[47]6882 dev[58]
dev[79]
dev[47]
7597 dev[46]
7597 dev[46]dev[64]
7597 dev[46]
dev[72]
dev[67]
7597 dev[46]
dev[55]
7597 dev[46]
7028 dev[46] dev[70]
7597 dev[46]
7028 dev[46]
dev[57]
dev[45]
dev[99]
7597 dev[46]
7028 dev[46]
dev[61]
dev[51]
7597 dev[46]
dev[58]
dev[46]
15850 dev[46]
dev[58]
dev[79]
dev[58]
9859 dev[46]
dev[54]
dev[45]
dev[61]
dev[58]
dev[54]
9859 dev[46]
9859 dev[46]
dev[49]
dev[53] 9859 dev[46]
15850 dev[46]
dev[59]
dev[56]
15850 dev[46]
dev[83] 15850 dev[46]
dev[48]
dev[49]
dev[53]
dev[56]
dev[83]
dev[59]
Project 9859
Project 15850
dev[48]
Topological Analysis of the Data
• Statistics inspected
–
–
–
–
–
–
–
Diameter
Average degree
Clustering coefficient
Degree distribution
Cluster size distribution
Relative size of major cluster
Fitness and life cycle
• Evolution of these statistics
• Dual networks
– developer network and project network
Terminology
•
•
•
•
•
•
•
Diameter
– Average length of shortest paths between all pairs of vertices
Degree
– The count of edges connected to given vertex
Average degree
– Average of the degrees of all vertices in the network
Cluster
– The connected components of the network
Clustering coefficient (CC)
– CCi: Fraction representing the number of links actually present relative to the total
possible number of links among the vertices in its neighborhood.
– CC: average of all CCi in a network
Degree distribution
– The distribution of degrees throughout a network
Major cluster
– The largest cluster in the network
Degree Distribution: Developers
Degree Distribution: Projects
Diameter of Developer
Network vs. Time
• Network size
increased
from 30,000
to 70,000
Diameter of Project
Network vs. Time
• Network size
increased from
20,000 to 50,000.
• Diameter
decreasing with
time both for
developer network
and project
network
Clustering Coefficient of Developer
Network vs. Time
Clustering Coefficient of Project
Network vs. Time
Cluster Size Distribution
• R2 with major
cluster is
0.7426
• R2 without
major cluster is
0.9799
Relative Size of Major Cluster vs.
Time
• Increase of the
relative size of
the major
cluster
• Approaching
steady-state?
An Example Research Question
• What processes can explain the evolution of the
project and developer social networks?
– Randomly growing network (Erdos-Reyni, 1960)?
– Evolving network with preferential attachment
(Barabasi-Albert, 1999)?
– Evolving network with preferential attachment and
fitness (Barabasi-Albert, 2001)?
– Others?
Computer Experiments
• Agent-based simulations
• Java programs using Swarm class library
– Validation (docking) exercises using Java/Repast
• Grow artificial SourceForge’s (Epstein & Axtell, 1996)
– Parameterized with observed data, e.g., developer behaviors
• Join rates
• New project additions
• Leave projects
– Evaluation of multiple models (hypotheses)
– Verification/validation
Cycles of Modeling & Simulation
Modeling
(Hypothesis)
Social Network Models
ER => BA => BA+Fitness => BA+Dynamic Fitness
Agent -Based
Simulation
Observation
Analysis of
SourceForge
Data
Degree Distribution
Average Degree
Diameter
Clustering Coefficient
Cluster Size Distribution
(Experiment)
Grow Artificial
SourceForge
Model for SourceForge
• ABM based on bipartite graph
• Model description
–
–
–
–
Agent: developer
Behaviors: Create, join, abandon and idle
Preference: developer’s and project’s
Fitness
• Four models in iterations
– ER, BA, BA with constant fitness and BA with dynamic
fitness
• Comparison of empirical and simulated data
ER Model – Degree Distribution
• Degree
distribution is
normal
distribution
while it is
power law in
empirical data
• Fit Fails!
ER Model - Diameter
• Average degree is
decreasing while it
is increasing in
empirical data
• Diameter is
increasing while it
is decreasing in
empirical data
• Fit Fails!
ER Model – Clustering
Coefficient • Clustering
coefficient is
relatively low under
0.3 while it is
around 0.7 in
empirical data.
• Fit fails!
ER Model – Cluster Size
Distribution
• Power law distribution
with R2 as 0.6667
(0.9653 without the
major cluster) while
R2 in empirical data is
0.7426 (0.9799
without the major
cluster)
• The actual distribution
is different from
empirical data
• Fit Fails!
BA Model – Degree Distribution
• Power laws in degree
distributions, similar to
empirical data (o for
simulated data and x for
empirical data).
• For developer
distribution: simulated
data has R2 as 0.9798
and empirical data has R2
as 0.9714.
• For project distribution:
simulated data has R2 as
0.6650 and empirical
data has R2 as 0.9838.
• Partial Fit!
BA Model – Diameter and
Clustering Coefficient
• Small diameter and
high clustering
coefficient like
empirical data
• Diameter and
clustering
coefficient are
both decreasing
like empirical data
• Good Fit!
BA Model with Constant Fitness
• Power laws in degree
distributions, similar to
empirical data (o for simulated
data and x for empirical data).
• For developer distribution:
simulated data has R2 as
0.9742 and empirical data has
R2 as 0.9714.
• For project distribution:
simulated data has R2 as
0.7253 and empirical data has
R2 as 0.9838.
• Improved fit!
Discovery: Project Life Cycle
BA Model with Dynamic Fitness
• Power laws in degree
distribution, similar to
empirical data (o for
simulated data and x for
empirical data).
• For developer distribution:
simulated data has R2 as
0.9695 and empirical data has
R2 as 0.9714.
• For project distribution:
simulated data has R2 as
0.8051 and empirical data has
R2 as 0.9838.
• Somewhat better fit!
Models of the F/OSS Social Network
(Alternative Hypotheses)
• General model features
– Agents are nodes on a graph (developers or projects)
– Behaviors: Create, join, abandon and idle
– Edges are relationships (joint project participation)
– Growth of network: random or types of preferential
attachment, formation of clusters
– Fitness
– Network attributes: diameter, average degree, degree
distribution, clustering coefficient
• Four specific models
–
–
–
–
ER (random graph) - (1960)
BA (preferential attachment) - (1999)
BA ( + constant fitness) - (2001)
BA ( + dynamic fitness) - (2003)
Summary
Summary
• Why Agent-Based Modeling and Simulation?
– Can be used as components of the Scientific Method
– A research approach for studying socio-technical
systems
• Case study: F/OSS - Collaboration Social Networks
– SourceForge conceptual models: ER, BA, BA with
constant fitness and BA with dynamic fitness.
– Simulations
• Computer experiments that tested conceptual models
• Provided insight into the phenomenon under study and guided
data mining of collected observations
Questions
• Validity of approaches
– Social networks
– Simulation
• Value/Utility of approachs
• Applicability to other areas of F/OSS research
– Project sites, e.g., Mozilla.org
– Individual projects, e.g., Linux kernel
Thank you