Transcript FLOSS `07
Do Programming Languages Affect Productivity?
A Case Study Us ing Data from Open Source
Projects
Daniel Delorey, Dr. Charles D. Knutson, Scott Chun
SEQUIOA Lab
Brigham Young University
Motivation
“Productivity seems constant in terms of
elementary statements, a conclusion that is
reasonable in terms of the thought a
statement requires and the errors it may
include.”
(Fred Brooks, The Mythical Man Month, p. 96)
Testing the Assertion
Based on data gathered from the CVS repositories
of 9,999 open source projects
All “Production” phase projects hosted on
SourceForge.net as of August, 2006
Test the assertion for the ten most popular
languages used in these projects
Popularity is defined in terms of:
project count
author count
file count
revision count
lines-of-code count
Data Collection
Data collected using cvs2mysql
Set of cross platform Python scripts
Extract data from CVS repositories
Write data to MySQL 5.0 import scripts
Individual script per repository
Structured so that multiple scripts can be imported into the
same database
All import scripts are publicly available
Data set metrics
More than 20 years of history
Heavily skewed toward the last six years
More than 7M files
More than 25M revisions
More than 24K developers
Language Popularity
Popularity Rankings for the ten most popular languages
Project Author File Revision LOC Average
Rank Rank Rank Rank Rank Rank
C
1
1
2
2
1
1
Java
2
2
1
1
2
2
C++
4
3
4
4
3
4
PHP
5
4
3
3
4
4
Python
7
7
5
5
5
6
Perl
3
5
9
9
6
6
JavaScript
6
6
6
8
10
7
C#
9
9
7
6
7
8
Pascal
8
10
8
7
8
8
Tcl
11
8
10
10
9
10
Our Methods
Multiple linear regression
Begin by developing a rich model which excludes
programming language as an explanatory factor
Add programming language as a factor
Test the significance of the programming language
effect
We perform all pair-wise language comparisons using a TukeyKramer Honest Significant Difference adjustment for multiple
comparisons
Our Goal
Test the assertion of constant programmer
productivity
Not to develop a predictive or an explanatory
model
Purpose of the model is to control the variation
in our data before testing the significance of the
programming language factor
Threats to Validity
Observational Data
Generalization and inference of cause-andeffect not necessarily appropriate
Underlying assumptions
All commits represent work performed by a
single author in a single year
Constant average time commitment across
languages
Constant average demand across languages
Dealing with Violated Assumptions
Six ways assumptions may be violated
Repository Migration
Dead-file Restoration
Multi-project Files
Gatekeepers
Batch Commits
Automatic Code Generation
Rudimentary Solutions
Exclude initial revisions
Exclude revisions that immediately follow dead revisions
Exclude projects where a single author committed more
than 80,000 lines of code in a single year
Model Development
We begin with 25 potential factors
Iteratively remove factors until to obtain
parsimonious model
explain a sufficient amount of the variation
Use the following criteria to remove factors
Variance Inflation Factor
Correlation
Practical Significance
Cp Statistic
Potential Factors
Language Related Factors Per Year
For the Current Year
Author Related Factors Per Year
For the Current Year
Months since first recorded use
Active projects using this language
Active authors using this language
Current files written in this language
Total number of lines written in this language
Aggregated Over Prior Years
Total projects having used this language
Total authors having used this language
Total files written in this language
Total number of lines written in this language
Months since first contribution
Active projects with contributions
Number of programming languages used
Current files edited
Total number of lines written
Aggregated Over Prior Years
Total projects with contributions
Total number of programming languages used
Total files edited by this author
Total number of lines written by this author
Language Specific Author Related Factors Per Year
Aggregated Over Prior Years
For the Current Year
Months since first contribution
Active projects with contributions
Current files edited
Temporal Factor
Total number of lines written
Total projects with contributions
Total files edited by this author
Calendar Year
Removed to due…
High Variance Inflation Factor
Practically Insignificant Coefficients
Low Correlation with the Dependent Variable
Variable Selection Using the Cp Statistic
Results
Statistical Significance of Pair-wise Language Comparisons
JavaScript Perl
Tcl Python PHP Java
C
C++
Perl
0.46
Tcl
0.60
1.00
Python
0.00
0.00 0.76
PHP
0.00
0.00 0.08
0.72
Java
0.00
0.00 0.02
0.18
1.00
C
0.00
0.00 0.00
0.01
0.53 0.97
C++
0.00
0.00 0.00
0.00
0.01 0.07 0.59
C#
0.00
0.00 0.00
0.02
0.26 0.50 0.83 1.00
Pascal
0.00
0.00 0.00
0.00
0.10 0.26 0.60 0.99
C#
1.00
Conclusions
Three groups with no statistically significant internal
differences but statistically significant external differences
An approximate progression from high-level, multiparadigm languages to low-level imperative/object
oriented languages
The assertion does appear to hold true for the languages
most similar to those Brooks was considering
Future Work
Additional studies of productivity
Relate lines of code to function points or
pattern points
Develop model to provide explanation not just
hypothesis testing
Consider additional explanatory factors
Studies to determine whether these
statistical significances translate into
practical significances
Questions?