Assieme: Finding and Leveraging Implicit References in a

Download Report

Transcript Assieme: Finding and Leveraging Implicit References in a

Assieme: Finding and Leveraging
Implicit References in a Web Search
Interface for Programmers
Raphael Hoffmann, James Fogarty, Daniel S. Weld
University of Washington, Seattle
UIST 2007
Programmers Use Search
• To identify an API
• To seek information about an API
• To find examples on how to use an API
Example Task:
“Programmatically output an
Acrobat PDF file in Java.”
Example:
General Web Search Interface
Example:
Code-Specific Web Search Interface
…
Problems
• Information is dispersed: tutorials, API itself,
documentation, pages with samples
• Difficult and time-consuming to …
–
–
–
–
locate required pieces,
get an overview of alternatives,
judge relevance and quality of results,
understand dependencies.
• Many page visits required
With Assieme we …
• Designed a new Web search interface
• Developed needed inference
Outline
• Motivation
• What Programmers Search For
• The Assieme Search Engine
– Inferring Implicit References
– Using Implicit References for Scoring
• Evaluation of Inference & User Study
• Discussion & Conclusion
Six Learning Barriers
faced by Programmers (Ko et al. 04)
• Design barriers
— What to do?
• Selection barriers
— What to use?
• Coordination barriers
— How to combine?
• Use barriers
— How to use?
• Understanding barriers
— What is wrong?
• Information barriers
— How to check?
Examining Programmer Web Queries
Objective
• See what programmers search for
Dataset
• 15 million queries and click-through data
• Random sample of MSN queries in 05/06
Procedure
• Extract query sessions containing ‘java’ – 2,529
• Manual looking at queries and defining regex filters
• Informal taxonomy of query sessions
Examining Programmer Web Queries
Examining Programmer Web Queries
64.1 %
35.9 %
Descriptive
Contain package, type or
member name
“java JSP current date”
“java SimpleDateFormat”
Selection barrier
Use barrier
17.9 % Contain terms like “example”, “using”, “sample code”
“using currentdate in jsp”
Coordination barrier
relevance
indicated by
# uses
Assieme
documentation
example
code
Summaries show
referenced types
required
libaries
links to
related
info
Challenges
?
How to put the right information
on the interface
• Get all programming-related data
• Interpret data and infer relationships
Outline
• Motivation
• What Programmers Search For
• The Assieme Search Engine
– Inferring Implicit References
– Using Implicit References for Scoring
• Evaluation of Inference & User Study
• Discussion & Conclusion
Assieme’s Data
Pages with
code examples
Queried Google on
“java ±import ±class …”
~2,360,000
JAR files
Downloaded library
files for all projects on
Sun.com, Apache.org,
Java.net, SourceForge.net
~79,000
JavaDoc pages
Queried Google on
“overview-tree.html …”
~480,000
… is crawled using existing search engines
The Assieme Search Engine
… infers 2 kinds of implicit references
JAR files
Pages with
code examples


Uses
of packages,
types and members
?
Matches
of packages,
types and members

JavaDoc pages
Extracting Code Samples
unclear segmentation
code in a different language (C++)
distracting terms ‘…’ in code
line numbers
Extracting Code Samples



remove HTML commands,
but preserve line breaks
remove some distracters
by heuristics
launch (error-tolerant) Java
parser at every line break
(separately parse for types,
methods, and sequences
of statements)
<html>
<head><title></title></head>
A simple example:
<body>
A simple example:<br><br>
import
java.util.*;
1:
1: import
import java.util.*;
java.util.*; <br>2:
class
c {c {
2:
class
class
c {<br>3: HashMap m =
HashMap
mm
= new
3:
HashMap
= new void
new
HashMap();<br>4:
HashMap();
f() { m.clear(); }<br>5:
void
f() {f()m.clear();
} }
4:
void
{ m.clear();
}<br><br>
}5: }
<a
href=“index.html”>back</a>
back
</body>
</html>
Resolving External Code References
Naïve approach of finding term matches does not work:
1
2
3
4
5
import java.util.*;
class c {
HashMap m = new HashMap();
void f() { m.clear(); }
}
?
Reference java.util.HashMap.clear() on line 4 only
detectable by considering several lines
 Use compiler to identify unresolved names
Resolving External Code References
• Index packages/types/members in Jar files
java.util.HashMap.clear()
java.util.HashMap
…
• Compile & lookup
unresolved
names
compile
index
lookup
JAR
files
put on
classpath
Utility function:
# covered references
(and JAR popularity)
JAR
files
greedily pick
best JARs
Scoring
• Existing techniques …
– Docs modeled as weighted term frequencies
– Hypertext link analysis (PageRank)
• … do not work well for code, because:
– JAR files (binary code) provide no context
– Source code contains few relevant keywords
– Structure in code important for relevance
Using Implicit References
to Improve Scoring
• Assieme exploits structure on Web pages
and structure in code
HTML hyperlinks
code references
Scoring
APIs
(packages/types/members)
Web pages
Scoring
APIs
• Use text on doc pages and on pages with code
samples that reference API (~ anchor text)
• Weight APIs by #incoming refs (~ PageRank)
Web Pages
• Use fully qualified references (java.util.HashMap)
and adjust term weights
• Filter pages by references
• Favor pages with accompanying text
Outline
• Motivation
• What Programmers Search For
• The Assieme Search Engine
– Inferring Implicit References
– Using Implicit References for Scoring
• Evaluation of Inference & User Study
• Discussion & Conclusion
Evaluating Code Extraction and
Reference Resolution
… on 350 hand-labeled pages from Assieme’s data
Code Extraction
• Recall 96.9%, Precision 50.1% ( 76.7%)
• False positives: C, C#, JavaScript, PHP, FishEye/diff
• (After filtering pages without refs: precision 76.7%)
Reference Resolution
• Recall 89.6%, Precision 86.5%
• False positives: Fisheye and diff pages
• False negatives: incomplete code samples
User Study
Assieme vs. Google vs. Google Code Search
Design
• 40 search tasks based on queries in logs:
query “socket java”  “Write a basic server that
communicates using Sockets”
• Find code samples (and required libraries)
• 4 blocks of 10 tasks: 1 for training + 1 per interface
Participants
• 9 (under-)graduate students in Computer Science
User Study – Task Time
F(1,258)=5.74
p ≈ .017
significant
*
F(1,258)=1.91
p ≈ .17
seconds ( SEM)
150
100
50
0
Assieme
Google
GCS
User Study – Solution Quality
0 seriously flawed
.5 generally good but fell short in critical regard
1 fairly complete
F(1,258)=55.5
F(1,258)=6.29 p < .0001
p ≈ .013
quality ( SEM)
1.0
*
*
0.8
0.6
0.4
0.2
0.0
Assieme
Google
GCS
User Study – # Queries Issued
F(1,259)=6.85
p ≈ .001
F(1,259)=9.77
p ≈ .002
2.5
#queries ( SEM)
*
*
2.0
1.5
1.0
0.5
0.0
Assieme
Google
GCS
Outline
• Motivation
• What Programmers Search For
• The Assieme Search Engine
– Inferring Implicit References
– Using Implicit References for Scoring
• Evaluation of Inference & User Study
• Discussion & Conclusion
Discussion & Conclusion
• Assieme – a novel web search interface
• Programmers obtain better solutions, using
fewer queries, in the same amount of time
• Using Google subjects visited 3.3 pages/task,
using Assieme only 0.27 pages, but 4.3 previews
• Ability to quickly view code samples changed
participants’ strategies
Thank You
Raphael Hoffmann
Computer Science & Engineering
University of Washington
[email protected]
James Fogarty
Computer Science & Engineering
University of Washington
[email protected]
Daniel S. Weld
Computer Science & Engineering
University of Washington
[email protected]
This material is based upon work supported by the National Science Foundation
under grant IIS-0307906, by the Office of Naval Research under grant N00014-06-10147, SRI International under CALO grant 03-000225 and the Washington Research
Foundation / TJ Cable Professorship.