Transcript FLOSS `07

Do Programming Languages Affect Productivity?
A Case Study Us ing Data from Open Source
Projects
Daniel Delorey, Dr. Charles D. Knutson, Scott Chun
SEQUIOA Lab
Brigham Young University
Motivation
“Productivity seems constant in terms of
elementary statements, a conclusion that is
reasonable in terms of the thought a
statement requires and the errors it may
include.”
(Fred Brooks, The Mythical Man Month, p. 96)
Testing the Assertion
 Based on data gathered from the CVS repositories
of 9,999 open source projects
 All “Production” phase projects hosted on
SourceForge.net as of August, 2006
 Test the assertion for the ten most popular
languages used in these projects
 Popularity is defined in terms of:





project count
author count
file count
revision count
lines-of-code count
Data Collection
 Data collected using cvs2mysql
 Set of cross platform Python scripts
 Extract data from CVS repositories
 Write data to MySQL 5.0 import scripts
 Individual script per repository
 Structured so that multiple scripts can be imported into the
same database
 All import scripts are publicly available
 Data set metrics
 More than 20 years of history
 Heavily skewed toward the last six years
 More than 7M files
 More than 25M revisions
 More than 24K developers
Language Popularity
Popularity Rankings for the ten most popular languages
Project Author File Revision LOC Average
Rank Rank Rank Rank Rank Rank
C
1
1
2
2
1
1
Java
2
2
1
1
2
2
C++
4
3
4
4
3
4
PHP
5
4
3
3
4
4
Python
7
7
5
5
5
6
Perl
3
5
9
9
6
6
JavaScript
6
6
6
8
10
7
C#
9
9
7
6
7
8
Pascal
8
10
8
7
8
8
Tcl
11
8
10
10
9
10
Our Methods
 Multiple linear regression
 Begin by developing a rich model which excludes
programming language as an explanatory factor
 Add programming language as a factor
 Test the significance of the programming language
effect
 We perform all pair-wise language comparisons using a TukeyKramer Honest Significant Difference adjustment for multiple
comparisons
Our Goal
 Test the assertion of constant programmer
productivity
 Not to develop a predictive or an explanatory
model
 Purpose of the model is to control the variation
in our data before testing the significance of the
programming language factor
Threats to Validity
 Observational Data
 Generalization and inference of cause-andeffect not necessarily appropriate
 Underlying assumptions
 All commits represent work performed by a
single author in a single year
 Constant average time commitment across
languages
 Constant average demand across languages
Dealing with Violated Assumptions
 Six ways assumptions may be violated






Repository Migration
Dead-file Restoration
Multi-project Files
Gatekeepers
Batch Commits
Automatic Code Generation
 Rudimentary Solutions
 Exclude initial revisions
 Exclude revisions that immediately follow dead revisions
 Exclude projects where a single author committed more
than 80,000 lines of code in a single year
Model Development
 We begin with 25 potential factors
 Iteratively remove factors until to obtain
 parsimonious model
 explain a sufficient amount of the variation
 Use the following criteria to remove factors




Variance Inflation Factor
Correlation
Practical Significance
Cp Statistic
Potential Factors
Language Related Factors Per Year
For the Current Year
Author Related Factors Per Year
For the Current Year
Months since first recorded use
Active projects using this language
Active authors using this language
Current files written in this language
Total number of lines written in this language
Aggregated Over Prior Years
Total projects having used this language
Total authors having used this language
Total files written in this language
Total number of lines written in this language
Months since first contribution
Active projects with contributions
Number of programming languages used
Current files edited
Total number of lines written
Aggregated Over Prior Years
Total projects with contributions
Total number of programming languages used
Total files edited by this author
Total number of lines written by this author
Language Specific Author Related Factors Per Year
Aggregated Over Prior Years
For the Current Year
Months since first contribution
Active projects with contributions
Current files edited
Temporal Factor
Total number of lines written
Total projects with contributions
Total files edited by this author
Calendar Year
Removed to due…
High Variance Inflation Factor
Practically Insignificant Coefficients
Low Correlation with the Dependent Variable
Variable Selection Using the Cp Statistic
Results
Statistical Significance of Pair-wise Language Comparisons
JavaScript Perl
Tcl Python PHP Java
C
C++
Perl
0.46
Tcl
0.60
1.00
Python
0.00
0.00 0.76
PHP
0.00
0.00 0.08
0.72
Java
0.00
0.00 0.02
0.18
1.00
C
0.00
0.00 0.00
0.01
0.53 0.97
C++
0.00
0.00 0.00
0.00
0.01 0.07 0.59
C#
0.00
0.00 0.00
0.02
0.26 0.50 0.83 1.00
Pascal
0.00
0.00 0.00
0.00
0.10 0.26 0.60 0.99
C#
1.00
Conclusions
 Three groups with no statistically significant internal
differences but statistically significant external differences
 An approximate progression from high-level, multiparadigm languages to low-level imperative/object
oriented languages
 The assertion does appear to hold true for the languages
most similar to those Brooks was considering
Future Work
 Additional studies of productivity
 Relate lines of code to function points or
pattern points
 Develop model to provide explanation not just
hypothesis testing
 Consider additional explanatory factors
 Studies to determine whether these
statistical significances translate into
practical significances
Questions?