Mining Open Source Software(OSS) Data Using Association Rules

Download Report

Transcript Mining Open Source Software(OSS) Data Using Association Rules

Mining sourceforge data to Discover
Models of Open Source Software
(OSS) Project Performance
Joseph Davis, Bavani Arunasalam, Simon
Poon, Sanjay Chawla
Knowledge Management Research Group
School of Information Technologies
The University of Sydney
'Data from the Field' EII
Workshop, 24 May 2007
1
Outline
•
•
•
•
•
Motivation for this project
Open Source Software (OSS) Development
SourceForge data repository
Data Mining Possibilities
Association Rule Mining and Association
Rules Network (ARN)
• Application of ARNs to OSS data
• Theory Building using Data Mining
• Conclusions and Future Research
'Data from the Field' EII
Workshop, 24 May 2007
2
Motivation
• Steady Success of Open Source
Software(OSS): Linux, Apache, Samba, Python,
MySQL
• KM group is trying to study a range OSS-related
questions using theoretical and data-driven
approaches
• Availability of extensive data on most aspects of
OSS projects
• Question: What are the key factors that can
explain ‘success’ in OSS projects?
'Data from the Field' EII
Workshop, 24 May 2007
3
Open Source Software
Development
• Non-proprietary and perceived to be socially
beneficial model of software development
• OS software in the public domain; source code freely
available for modification and distribution
• Nearly 200,000 projects in progress, each involving
dozens to hundreds of (geographically distributed)
developers who coordinate their work through the
internet
• Increasingly viewed as a viable model for building
robust, secure, and scalable software - commonsbased peer production model/distributed innovation.
'Data from the Field' EII
Workshop, 24 May 2007
4
OSS Trends
• Growing acceptance of OS software in
organizations,
• Increasing participation by large software
companies such as IBM, Sun, HP etc. in
OSS development
• Increasingly viable software distribution
business models
• Large and growing communities of OSS
developers and users
'Data from the Field' EII
Workshop, 24 May 2007
5
Untested Claims regarding OSS
development
• Good software evolves when a dedicated
community (of developers and programmers)
work cooperatively (in comparison with the
more traditional hierarchical and closed
model (OSI, 2001), ‘Cathedral’ and the
‘bazaar’ analogy.
• Quality, speed, portability, and scalability of
the resulting software.
• Taming complexity, fewer bugs (many
eyeballs phenomenon)
• Offers a viable model for the emerging ‘virtual
organisations’.
'Data from the Field' EII
Workshop, 24 May 2007
6
Open Research Questions
• How do we discover crucial relationships
that characterise successful and
unsuccessful OSS projects?
• How can we develop models (specifying
hypotheses) of the critical determinants of
OSS project performance?
• What constitutes good performance in
OSS development?
'Data from the Field' EII
Workshop, 24 May 2007
7
Field Data for OSS Research
• SourceForge.net is the largest OSS
development website.
• Besides hosting, SourceForge.net
provides services for version control, bugtracking etc.
• Nearly 200,000 projects grouped under 17
categories; over 2 million users.
• Great source of ‘field’ data to research
OSS development.
'Data from the Field' EII
Workshop, 24 May 2007
8
'Data from the Field' EII
Workshop, 24 May 2007
9
Problems with SourceForge
• Number of ongoing OSS projects is
misleading. Most of the overall activity
levels accounted for by fewer than 10% of
the projects (Pareto distributions)
• Need for purposeful sampling and careful
datacleaning – extreme variations across
projects and noise
'Data from the Field' EII
Workshop, 24 May 2007
10
Problem Definition
• GIVEN: OSS Data downloaded from
SourceForge.net
• OBJECTIVE: Find patterns which
characterize a high performing OSS
project
• CONSTRAINTS: Performance surrogate
variable to be number of downloads.
'Data from the Field' EII
Workshop, 24 May 2007
11
Why not statistical models?
• Attributes were heterogeneous
type:numerical and discrete
• Data plagued with missing values
• Downloads followed a Pareto distribution
– Most downloads few but long tail
– Ex: median download 70 but can be upto
600000
'Data from the Field' EII
Workshop, 24 May 2007
12
Association Rules
•
Association rule mining:
–
•
Applications:
–
•
Finding frequent patterns, associations,
correlations among sets of items or objects in
transaction databases, relational databases,
and other information repositories.
Market basket data analysis, cross-marketing,
catalog design, loss-leader analysis, clustering,
classification, etc.
Examples.
–
–
Rule form: “Body ead for a given
[support, confidence]”.
'Data from
Field' EII
buys(x, “diapers”)
thebuys(x,
“beers”) [1 %,
Workshop, 24 May 2007
13
Typical Association Rule Mining
Approaches
• Discover robust association rules that are
non-obvious and actionable,
• Discover frequent item sets as features
that serve as discriminators for
classification and prediction (based on a
class variable)
• Our approach seeks to discover a graph
structure that characterises performance
based on the mined association rules.
'Data from the Field' EII
Workshop, 24 May 2007
14
Association Rules
• Given: (1) database of transactions (OSS
projects), (2) each transaction is a list of
items (project variable values)
• Find: all rules that correlate the presence
of one set of items with that of another set
of items
– E.g., 72% of OSS projects for which bug fixing
activity level is high and whose (number of
developers =‘high”) ----- (number of
downloads=‘high’)
'Data from the Field' EII
Workshop, 24 May 2007
15
Problems with Association Rule Mining
• Too many (irrelevant/redundant) rules generated
• Measures of “interestingness” still primitive and
not general
• Our solution: A pruning strategy – create an
Association Rules Network in a recursive
manner:
Related Work:
S. Chawla, J. Davis, G. Pandey, "On Local
Pruning of Associaton Rules Using Directed
Hypergraphs", IEEE Conference on Data
Engineering (ICDE’04)
'Data from the Field' EII
Workshop, 24 May 2007
16
Association Rules Network
• Consider a binary
table R(A,B,C,D,E,F,G)
B=1
F=0
•
•
•
•
•
•
{B=1, C=1} -> {A=1}
{D=1} -> {A=1}
{F=0} ->{B=1}
{F=0, E=1} -> {C=1}
{E=1, G=0} -> {D=1}
{A=1,G=1} ->{E=1}
C=1
A=1
E=1
'Data from the Field' EII
Workshop, 24 May 2007
D=1
G=0
Fix a consequent {A=1}
17
ARN Definition
An ARN (R,z) is a weighted directed
hypergraph G= (V U z, E) where z is a
distinguished sink item (node) and R is the
set of association rules such that
• Each hyperedge E corresponds to a rule R
whose consequent is a singleton,
• There is a hyperedge which corresponds
to a rule r whose consequent is the single
item z.
'Data from the Field' EII
Workshop, 24 May 2007
18
ARN Definition cont..
• The distinguished vertex z is reachable
from any other vertex in G.
• Any vertex p not equal to z is not
reachable from z.
• The weight of the edges correspond to the
confidence of the rule that they
encapsulate.
'Data from the Field' EII
Workshop, 24 May 2007
19
Sampling
• Results based on a sample of 2301
‘stable’ or ‘production’ projects which were
initiated in the second half of 1999.
'Data from the Field' EII
Workshop, 24 May 2007
20
ARN for High Download
#Support Request
= High
78.7%
#Bugs Fixed
= High
#Patches Completed
= High
90%
73.8%
#CVS Committed
= High
#Bugs Found
= High
55.3%
#Forum
Messages
= High
#Public Forums
= High
#Download
= High
68.4%
93.3%
67.9%
# Administrators
= High
# Developers
= High
OS = POSIX
'Data from the Field' EII
Workshop, 24 May 2007
21
ARN for Low Download
#Task Completed
= Low
#Support Completed
= Low
#Support
Requested
= Low
#Public
Forum
= Low
Environment
= Web based
95.3%
#Bugs Fixed
= Low
#Download
= Low
#Forum
# Developers
= Low
77.9%
# OS = 1
#Surveys
= Low
#CVS Committed
= Low
60.1%
Messages
= Low
# Administrators
= Low
# Patches
= Low
#Bugs Found
= Low
92.1%
# Mailing Lists
= Low
# Environments
=1
'Data from the Field' EII
Workshop, 24 May 2007
22
Resulting Network
#Bugs Fixed
#Bugs Found
#CVS Committed
#Public
Forum
= Low
#Forum
Messages
#Download
#CVS Committed
# Developers
#Administrators
'Data from the Field' EII
Workshop, 24 May 2007
23
Critical Factors
• Coding and bug fixing activity levels
• Communication intensity
• Core development team strength
'Data from the Field' EII
Workshop, 24 May 2007
24
Validation with Factor
Analysis(FA)
• Independently applied FA.
• Factors are mutually orthogonal variables
which are linear combinations of subsets
of original variables.
• The factor structures generally consistent
with the ARN results.
'Data from the Field' EII
Workshop, 24 May 2007
25
'Data from the Field' EII
Workshop, 24 May 2007
26
Related Research Projects
• Temporal analysis of OSS project
evolution
• Studies of OSS communities
• Analysis of OS software code and
community co-evolution (Samba)
• Study of open source software
deployment in organisations.
'Data from the Field' EII
Workshop, 24 May 2007
27
Conclusion
• Need to understand the key drivers for
OSS beyond experience-based intuition
and isolated case studies
• Association Rules Network(ARN) give
some insight into the process
• These insights consistent with results
from Software Engineering
• Factor Analysis as a form of validation
'Data from the Field' EII
Workshop, 24 May 2007
28