Boa: A Language and infrastructure for analyzing ultra-large

Transcript Boa: A Language and infrastructure for analyzing ultra-large

BOA: A LANGUAGE AND INFRASTRUCTURE
FOR ANALYZING ULTRA-LARGE-SCALE
SOFTWARE REPOSITORIES
Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen
Iowa State University, USA
Presenter: Joshan V John
Instructor: Christoph Csallner
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
1
Agenda
 Motivation
 Ultra-large-scale software repositories
 Barriers to mining software repositories
 Solution - Boa
 Goals of Boa
 Boa Architecture
 Evaluation
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
2
Motivation
 Big-3 software repositories known to have close to 1
million projects.
 Contains a wealth of software and information about
software.
 Systematic extraction of relevant data from these
repositories and their analysis for testing hypotheses
is hard.
 Boa, a domain-specific language and infrastructure,
developed to ease testing ‘Mining Software
Repository’ related hypotheses.
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
3
Ultra-large-scale Software Repositories
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
4
Why analyze software repositories?
 Curiosity
 Identify patterns
 Forecasting
 Plan for better designs
 Empirical Validation
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
5
Barriers to mining software repositories
 Develop programming expertise to
access version control system.
 Establish infrastructure to store
downloaded data from software
repositories.
 Develop an application to access this local data.
 Improve scalability of analysis infrastructure to
process ultra-large-scale data.
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
6
Barriers to mining software repositories
 Experiments are often irreproducible
 Low reusability of experimental infrastructure
 Lack of systematic curation leads to loss of
experimental data.
 Building analysis infrastructure to process ultra-
large-scale data efficiently can be very hard.
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
7
Solution - Boa
 Designed a domain specific language and
infrastructure to analyze ultra-large-scale
software repositories – Boa.
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
8
Goals of Boa
 Easy to use
 Better abstractions
 Efficient & Scalable
 Enhances reproducibility
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
9
A Research Question
 Consider a program that answers:
“What are the churn rates for all Java projects
that use SVN?”
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
10
Solution in Java
 Full program over 70





lines of code.
Uses JSON and SVN
libraries.
Runs sequentially.
Takes over 24 hours.
Takes almost 3 hours
with data locally
cached.
Can be parallelized,
but very complex.
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
11
Solution in Boa




Simple program, 6 lines of code.
Hides implementation specifics.
Auto parallelization, results in 1 minute.
Results can be easily reproduced by publishing
these small programs with the data sets used.
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
12
Performance Results
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
13
Boa Architecture
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
14
Boa Architecture
 Three main components
 The Boa Language
 Boa Compiler & Runtime
 Supporting data infrastructure
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
15
The Boa Language
 Domain-Specific Types
 MapReduce Support
 Quantifiers
 User defined functions
 Output Aggregators
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
16
Boa Language – Domain-Specific Types
 Provides several domain-specific types which
aid in abstracting mining software repository
details (http://boa.cs.iastate.edu/docs/dsl-types.php)
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
17
Boa Language – MapReduce Support
 Computations specified via two user-defined
functions:
 Mapper – takes key-value pairs as input &
produces key-value pairs as output.
 Reducer – Consumes the above output and
aggregates data based on individual keys.
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
18
Boa Language – Quantifiers
 Boa defines the quantifiers:
 exists
 foreach
 ifall
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
19
Boa Language – User-Defined Functions
 Users can define their own mining algorithms
 Facilitates code re-use.
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
20
Boa Language – Output aggregators
 Output can be indexed
 Output defined in terms of predefined data
aggregators
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
21
Boa’s Supporting Infrastructure
 Compiler & Runtime
 Data Infrastructure
 Web based interface
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
22
Boa’s Compiler & Runtime
 Initial implementation was based upon the Sizzle
compiler & framework
 Sizzle is an open-source Java implementation of
the Sawzall language.
 Sizzle provides support for generating programs
that run on the Hadoop open-source MapReduce
framework.
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
23
Boa’s Data Infrastructure
 Local cache of repository information.
 First Step – Locally replicate data.
 Second Step – Run the caching translator to
convert data into the framework required format.
 Input (JSON file + SVN repositories) -> Output
(Hadoop SequenceFile)
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
24
Boa’s Web based Interface
 Submit programs.
 Compile & run them on their clusters.
 Each submission creates a job in the system.
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
25
Evaluation
 Programs were executed on a Hadoop 1.0.3
install.
 Cluster was not tuned for performance, except
for setting the maximum number of map tasks
for each compute node equal to the number of
cores on that node and increasing the VM heap
size.
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
26
Evaluation – Applicability
 Research Question 1 – Does Boa help
researchers analyze ultra-large-scale software
repositories?
 A set of 21 tasks in four different categories
were examined.
 Programming Languages
 Project Management
 Legal
 Platform/Environment
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
27
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
28
Evaluation - Applicability
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
29
Evaluation - Scalability
 Research Question 2 – Does the approach
scale to the size of the cluster?
 Research Question 3 – Does the approach
scale with the size of the input?
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
30
Evaluation - Scalability
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
31
Evaluation - Scalability
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
32
Evaluation - Reproducibility
 Research Question 4 – Using their
infrastructure, can researchers easily
reproduce previously published results?
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
33
Evaluation - Reproducibility
 Conducted controlled experiment
 Selected group of 8 researchers
 Each chose 3 tasks
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
34
References
 http://design.cs.iastate.edu/papers/ICSE-
13/icse13.pdf
 http://boa.cs.iastate.edu/docs/
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
35
Joshan Valayil John | The University of Texas at Arlington | CSE 6324
36

Boa: A Language and infrastructure for analyzing ultra-large

Transcript Boa: A Language and infrastructure for analyzing ultra-large

Directory