Transcript Slide 1

Agent-Based Modeling and Simulation
of Collaborative Social Networks
Research in Progress
Greg Madey
Yongqin Gao
Computer Science &
Engineering
University of Notre Dame
Vincent Freeh
Computer Science
North Carolina State
University
Renee Tynan
Chris Hoffman
Department of
Management
University of Notre Dame
AMCIS2003
Tampa, FL
August 2003
Supported in part by the
National Science Foundation - Digital Society & Technology Program
Outline
• Definitions: Agents, models, simulations, collaborative
social networks, computer experiments
• Phenomenon: Free/Open Source Software (F/OSS)
• Conceptual models
–
–
–
–
ER model
BA model
BA model with constant fitness
BA model with dynamic fitness
• Experiments and results
• Summary
• Some discussion questions
Agent-Based Modeling and
Simulation
• Conceptual models of a phenomenon
• Simulations are computer implementations of the
conceptual models
• Agents in models and simulations are distinct
entities (instantiated objects)
– Tend to be simple, but with large numbers of them
(thousands, or more) - i.e., swarm intelligence
– Contrasted with higher level “intelligent agents”
• Foundations in complexity theory
– Self-organization
– Emergence
Collaborative Social Networks
• Research-paper co-authorship, small world phenomenon, e.g., Erdos
number (Barabasi 2001, Newman 2001)
• Movie actors, small world phenomenon, e.g., Kevin Bacon number
(Watts 1999, 2003)
• Interlocking corporate directorships
• Open-source software developers (Madey et al, AMCIS 2002)
• Collaborators are nodes in a graph, and collaborative relationship are
the edges of the graph
Classical Scientific Method
1. Observe the world
a) Identify a puzzling phenomenon
2. Generate a falsifiable hypothesis (K. Popper)
3. Design and conduct an experiment with the
goal of disproving the hypothesis
a) If the experiment “fails”, then the hypothesis is
accepted (until replaced)
b) If the experiment “succeeds”, then reject
hypothesis, but additional insight into the
phenomenon may be obtained and steps 2-3
repeated
The Computer Experiment
Agent-Based Simulation as
a Component of the
Scientific Method
Modeling
(Hypothesis)
Observation
Agent -Based
Simulation
(Experiment)
Agent-Based Simulation as
a Component of the
Scientific Method
Modeling
(Hypothesis)
Social Network
Model of F/OSS
Observation
Analysis of
SourceForge
Data
Agent -Based
Simulation
(Experiment)
Grow Artificial
SourceForge
Open Source Software (OSS)
• Free …
GNU
Savannah
–
–
–
–
to view source
to modify
to share
of cost
• Examples
–
–
–
–
–
–
–
–
–
–
Apache
Perl
GNU
Linux
Sendmail
Python
KDE
GNOME
Mozilla
Thousands more
Linux
Free Open Source Software (F/OSS)
• Development
–
–
–
–
–
–
Mostly volunteer
Global teams
Virtual teams
Self-organized - often peer-based meritocracy
Self-managed - but often a “charismatic” leader
Often large numbers of developers, testers, support help, end
user participation
– Rapid, frequent releases
– Mostly unpaid
F/OSS
Developers
Larry Wall
Perl
Linus Tolvalds
Linux
Eric Raymond
Cathedral and Bazaar
Richard Stallman
GNU
GNU Manifesto
F/OSS: A Puzzling Phenomenon
• Contradicts traditional
wisdom:
–
–
–
–
–
–
Software engineering
Coordination, large numbers
Motivation of developers
Quality
Security
Business strategy
• Almost everything is done
electronically and available in
digital form
• Opportunity for IS Research
-- large amounts of online
data available
• Research issues:
– Understanding motives
– Understanding processes
– Intellectual property
– Digital divide
– Self-organization
– Government policy
– Impact on innovation
– Ethics
– Economic models
– Cultural issues
– International factors
SourceForge
• VA Software
• Part of OSDN
• Started 12/1999
• Collaboration tools
• 58,685 Projects
• 80,000 Developers
• 590,00 Registered
Users
Savannah
• Uses SourceForge
Software
• Free Software
Foundation
•1,508 Projects
•15,265 Registered
Users
F/OSS: Importance
Major Component of e-Technology Infrastructure with major
presence in
e-Commerce
e-Science
e-Government
e-Learning
Apache has over 65% market share of Internet Web servers
Linux on over 7 million computers
Most Internet e-mail runs on Sendmail
Tens of thousands of quality products
Part of product offerings of companies like IBM, Apple
Apache in WebSphere, Linux on mainframe, FreeBSD in OSX
Corporate employees participating on OSS projects
Free/Open Source Software
• Seems to challenge traditional economic assumptions
• Model for software engineering
• New business strategies
– Cooperation with competitors
– Beyond trade associations, shared industry research, and
standards processes — shared product development!
• Virtual, self-organizing and self-managing teams
• Social issues, e.g., digital divide, international
participation
• Government policy issues, e.g., US software industry,
impact on innovation, security, intellectual property
Observations
• Web mining
• Web crawler (scripts)
–
–
–
–
•
•
•
•
•
•
Python
Perl
AWK
Sed
Monthly
Since Jan 2001
ProjectID
DeveloperID
Almost 2 million records
Relational database
PROJ|DEVELOPER
8001|dev378
8001|dev8975
8001|dev9972
8002|dev27650
8005|dev31351
8006|dev12509
8007|dev19395
8007|dev4622
8007|dev35611
8008|dev8975
Models of the F/OSS Social Network
(Alternative Hypotheses)
• General model features
– Agents are nodes on a graph (developers or projects)
– Behaviors: Create, join, abandon and idle
– Edges are relationships (joint project participation)
– Growth of network: random or types of preferential
attachment, formation of clusters
– Fitness
– Network attributes: diameter, average degree, degree
distribution, clustering coefficient
• Four specific models
–
–
–
–
ER (random graph) - (1960)
BA (preferential attachment) - (1999)
BA ( + constant fitness) - (2001)
BA ( + dynamic fitness) - (2003)
F/OSS Developers - Collaboration Social Network
Developers are nodes / Projects are links
24 Developers
5 Projects
2 Linchpin Developers
1 Cluster
Project 7597
dev[64]
dev[72]
dev[67]
Project 6882
Project 7028
dev[52]
dev[65]
dev[70]
dev[57]
6882 dev[47]
dev[52]
6882 dev[47]
dev[55]
dev[47]
dev[99]
dev[55]
dev[51]
6882 dev[47]6882 dev[58]
dev[79]
dev[47]
7597 dev[46]
7597 dev[46]dev[64]
7597 dev[46]
dev[72]
dev[67]
7597 dev[46]
dev[55]
7597 dev[46]
7028 dev[46] dev[70]
7597 dev[46]
7028 dev[46]
dev[57]
dev[45]
dev[99]
7597 dev[46]
7028 dev[46]
dev[61]
dev[51]
7597 dev[46]
dev[58]
dev[46]
15850 dev[46]
dev[58]
dev[79]
dev[58]
9859 dev[46]
dev[54]
dev[45]
dev[61]
dev[58]
dev[54]
9859 dev[46]
9859 dev[46]
dev[49]
dev[53] 9859 dev[46]
15850 dev[46]
dev[59]
dev[56]
15850 dev[46]
dev[83] 15850 dev[46]
dev[48]
dev[49]
dev[53]
dev[56]
dev[83]
dev[59]
Project 9859
Project 15850
dev[48]
Computer Experiments
• Agent-based simulations
• Java programs using Swarm class library
– Validation (docking) exercises using Java/Repast
• Grow artificial SourceForge’s (Epstein & Axtell, 1996)
– Parameterized with observed data, e.g., developer behaviors
• Join rates
• New project additions
• Leave projects
– Evaluation of four models (hypotheses)
– Verification/validation
Four Cycles of Modeling &
Simulation
Modeling
(Hypothesis)
Social Network Models
ER => BA => BA+Fitness => BA+Dynamic Fitness
Agent -Based
Simulation
Observation
Analysis of
SourceForge
Data
Degree Distribution
Average Degree
Diameter
Clustering Coefficient
Cluster Size Distribution
(Experiment)
Grow Artificial
SourceForge
ER model – degree distribution
• Degree
distribution is
binomial
distribution while
it is power law in
empirical data
• Fit fails
ER model - diameter
• Average degree is
decreasing while it is
increasing in empirical
data
• Diameter is increasing
while it is decreasing in
empirical data
• Fit fails
ER model – clustering coefficient
• Clustering coefficient is
relatively low around 0.4
while it is around 0.7 in
empirical data.
• Clustering coefficient is
decreasing while it is
increasing in empirical
data
• Fit fails
ER model – cluster distribution
• Cluster distribution in ER
model also have power law
distribution with R2 as 0.6667
(0.9953 without the major
cluster) while R2 in empirical
data is 0.7457 (0.9797
without the major cluster)
• The actual distribution is
different from empirical data
• The later models (BA and
further models) have similar
behaviors
• Fit fails
BA model – degree distribution
• Power laws in degree
distribution, similar to
empirical data (+ for
simulated data and x for
empirical data).
• For developer distribution:
simulated data has R2 of
0.9798 and empirical data has
R2 of 0.9712.
– Fit succeeds
• For project distribution:
simulated data has R2 of
0.6650 and empirical data has
R2 of 0.9815.
– Fit fails
BA model – diameter and CC
• Small diameter and high
clustering coefficient like
empirical data
• Diameter and clustering
coefficient are both
decreasing like empirical
data
• Fit succeeds
BA model with constant fitness
• Power laws in degree distribution,
similar to empirical data (+ for
simulated data and x for empirical
data).
• For developer distribution:
simulated data has R2 as 0.9742 and
empirical data has R2 as 0.9712.
– Fit succeeds
• For project distribution: simulated
data has R2 as 0.7253 and empirical
data has R2 as 0.9815.
– Fit fails
• Diameter and CC are similar to
simple BA model.
– Fit succeeds
Discovery: BA with dynamic fitness
• Problem with BA with constant fitness
– Intuition: Project fitness might change with time.
• Data mining observation: project “life cycle”
property - fitness generally decreases with time
• New model not in the literature
– Hypothesis: BA with dynamic fitness of projects
– Computer experiment
BA model with dynamic fitness
• Power laws in degree
distribution, similar to
empirical data (+ for
simulated data and x for
empirical data).
• For developer distribution:
simulated data has R2 as
0.9695 and empirical data has
R2 as 0.9712.
– Fit succeeds (as before)
• For project distribution:
simulated data has R2 as
0.8051 and empirical data has
R2 as 0.9815.
– Fit is better, but more work
needed
Summary
• Why Agent-Based Modeling and Simulation?
– Can be used as components of the Scientific Method
– A research approach for studying socio-technical
systems
• Case study: F/OSS - Collaboration Social Networks
– SourceForge conceptual models: ER, BA, BA with
constant fitness and BA with dynamic fitness.
– Simulations
• Computer experiments that tested conceptual models
• Provided insight into the phenomenon under study and guided
data mining of collected observations
Discussion
• “The social sciences are, in fact, the ‘hard’ sciences”,
Herbert Simon (1987)
• Computational social science: agent-based modeling
and simulation
• Kuhn’s periods of “Normal Science” punctuated by
“Paradigm shifts”
• Karl Popper’s “theory-testing through falsification”
• Relevant literature on the role of simulation in the
process of scientific discovery
Thank you