Transcript Sourcererx
Presented by
Asheq Hamid
An Example
Find a code snippet where:
class “B” inherits class “A”
And class “A” has a method named “C”
And class “B” overrides that method “C”
class A{
class B extends A{
void C (){
// Body of method C
void C (){
// Body of the
//Overridden method
}
}
How to leverage this type of searching?
Google code search, Koders , Krugle code search ??
Textual search : Ignores rich structural information in the code.
Sourcerer provides an infrastructure upon which this type of
searching can be implemented.
Outline of the presentation
Sourcerer Infrastructure : How the repository has been created.
Sourcerer Web Services : What service Sourcerer developers are
providing on top of this infrastructure.
Application to Existing Tools : How different existing tools can be
benefited from Sourcerer repository.
Future Work/Extension : What can be some useful additions to the
project.
Key References
E. Linstead, S. Bajracharya, T. Ngo, P. Rigor, C. Lopes, P. Baldi. Sourcerer: Mining and
Searching Internet-Scale Software Repositories. Data Mining and Knowledge
Discovery 2008
S. Bajracharya, J. Ossher, and C. Lopes. Sourcerer – An Infrastructure for Large-scale
Collection and Analysis of Open-source Code. In Proceedings of the Third
International Workshop on Academic Software Development Tools and Techniques
(WASDeTT-3), 2010.
Sourcerer Project Website: http://sourcerer.ics.uci.edu/
Infrastructure
The Sourcerer infrastructure comprises five major subsystems:
1.
2.
3.
4.
5.
A system to crawl and manage software repositories.
A system to parse and extract features from the code.
A relational database to store the information.
Various tools to mine, search the database.
A Web-based graphical interface.
1. Crawl and manage software repositories
•
External code repository •
•
(on the web)
•
Current prototype supports java projects.
Sourceforge
Apache
Colt and Weka
Code crawler
• Crawlers for downloading from Sourceforge.
• Web spiders to get code from academic
repository and researcher’s web site
Local storage
• Keeps local copy of projects in local hard disk.
2. Parse and extract features from the code
A parser has been written on top of Eclipse’s AST parser.
Mainly the following information is extracted after parsing:
Entity
Relation
Keyword
Fingerprint
Several passes required on the source code to extract all these information.
Entity
PACKAGE
CLASS
INTERFACE
ENUM
ANNOTATION
INITIALIZER
FIELD
CONSTRUCTOR
METHOD
PARAMETER
LOCAL VARIABLE
ARRAY
Relation
Relation
Description
Example
Extend
Class inheritence
classA extends classB
Implement
Interface implementation
classB implements interfaceB
Returns
Method return value
Java.lang.String.toCharArray()
returns char[]
Calls
Method invocation
Void foo(){ bar();}
Inside
Physical containment
Java.lang.String inside java.lang
………
…………
………………….
Relation….
Keywords
Why keywords??
Because they are useful for faster retrieval of search result.
How keywords are extracted ?
Fully qualified names are broken according to java convention.
For example: “quickSort” is broken into two keywords : “quick” and
“sort” .
Fingerprints
What is fingerprint?
code with particular syntactical signatures .
Example:
Find a code snippet with three nested loops.
Find a switch statement with seven cases.
Fingerprints are useful for structural searches of source code.
Cross project dependency
What is cross project dependency?
Every project has some external dependencies. These dependencies are
typically packaged in jar files and included along with the source code.
Sourcerer keeps track of these dependency files.
In case of a missing dependency file, Sourcerer tries to locate that jar file base
upon missing dependency information.
3. Store information in the database.
Database
SourcererDB
ArtifactDB
Stores relational
information extracted from
source code
Stores information about jar
files for automated resolution
of missing dependency
4. Tools created
Tools
Code Crawler: Takes a set of root URLs as an input and produces a list of download URLs
and version control links along with other project specific metadata .
Repository Creator: Deletes duplicate links and downloads the source code.
Feature Extractor: Extracts detailed structural information from source code.
Database Importer: Imports extracts information in SourcererDB and ArtifactDB.
Code Indexer: Code indexer produces a semi-structured full text index.
5. Web-based graphical interface
Can be found at:
http://sourcerer.ics.uci.edu/sourcerer/search/index.jsp
Sourcerer Web services:
Code Search
Repository Access
Dependency Slicing
Similarity Calculation
Code Search:
Input: A combination of terms and fields. The query language is based on
Lucene’s implementation.
Output: A result set with detailed information on the entities that matched
the queries.
An example query
** Find a method with terms ”week” and ”date” in its short name,
that returns a ”String” type and
takes in argument with the term ”Date” in its name.
Corresponding query:
short name: (week date) AND
entity type: METHOD AND
m_ret_type_sname_contents: String AND
m_sig_args_fqn_contents: Date
Repository Access
Input: Id of (file | entity | relation | comment)
Output: The file that contains the id.
Dependency Slicing:
What is a dependency slice?
A dependency slice of an entity is a program (collection of Java source
files) which includes that entity as well as all the entities upon which it
depends. A dependency slice should be immediately compilable.
Dependency Slicing…
Input: One or more entity ids.
Output: A zip file containing the collection of sliced/synthesized Java
files that the given set of entities depend on.
Similarity Calculation
Input: An entity id.
Output: A list of other entities that are similar to the input entity.
*How the similarity has been calculated is out of the scope of the paper.
Application to Existing Tools
1. Finding better code snippet
Strathcona:
Strathcona is a tool that also uses structural information to find code
examples.
Its code repository structure is very similar to that of Sourcerer.
The large repository of Sourcerer can help Strathcona searching code in
a bigger repository.
1. Finding better code snippet…..
Parseweb:
Parseweb is a tool that provides example for object instantiation.
It downloads code from google code search for examples of object
instantiation.
There is no way to automatically resolve missing dependencies, it uses
some heuristics.
Sourcerer can benefit Parseweb a great deal as the external
dependencies are automatically resolved before downloading the
source code.
2. Information mining
SpotWeb and CodeWeb
Look for hotspots: From different API ,finds the entities which have
most associations to other entities.
Using Sourcerer, hotspots could be detected directly simply by ordering
the entities in a jar by the number of incoming relations.
3. Test driven code search
CodeGenie and Code Conjurer, both use the context provided by a test
case to formulate queries.
Code Conjurer’s dependency resolution can be empowered by
Sourcerer’s automatic dependency resolution ability.
Future Extensions
1. Multiple Language Support
Currently support only Java.
Add support for other languages.
2. Addressing Evolution.
Needs to create separate project for each new version of a project.
Add automated support for adopting new versions without creating a
new project each time .
Future Extensions….
3. Considering non-code artifacts.
Many code project contains non-code artifacts like data on issues,
bugs, documentation, authorship, developer’s activities/ history etc.
Find an approach to connect Sourcerer’s models and services with
these non-code artifacts.
4. Intergrating with other open source platforms.
There are some open-source quality monitoring platforms.
FLOSSmole project is a collaborative effort to collect and analyze large
amount of open source project data.
Its database contains more project specific metadata.
Sourcerer contains more structure specific data.
Therefore, integrating Sourcerer with FLOSSmole could widen the
scope and impact of both projects
Questions?