Assieme: Finding and Leveraging Implicit References in a Web Search Interface for Programmers Raphael Hoffmann, James Fogarty, Daniel S. Weld University of Washington, Seattle UW CSE Industrial Affiliates Meeting 2007
Programmers Use Search To identify an API To seek information about an API To find examples on how to use an API Example Task: Programmatically output an Acrobat PDF file in Java.
Example: General Web Search Interface
Example: Code-Specific Web Search Interface
Problems Information is dispersed: tutorials, API itself, documentation, pages with samples Difficult and time-consuming to locate required pieces, get an overview of alternatives, judge relevance and quality of results, understand dependencies. Many page visits required
With Assieme we Designed a new Web search interface Developed needed inference
Outline Motivation What Programmers Search For The Assieme Search Engine Inferring Implicit References Using Implicit References for Scoring Evaluation of Inference & User Study Discussion & Conclusion
Six Learning Barriers faced by Programmers (Ko et al. 04) Design barriers Selection barriers Coordination barriers combine? Use barriers What to do? What to use? How to How to use? Understanding barriers What is wrong? Information barriers How to
Examining Programmer Web Objective Queries See what programmers search for Dataset 15 million queries and click-through data Random sample of MSN queries in 05/06 Procedure Extract query sessions containing java 2,529 Manual looking at queries and defining regex filters Informal taxonomy of query sessions
Examining Programmer Web Queries
Examining Programmer Web Queries 64.1 % 35.9 % Descriptive java JSP current date Selection barrier 17.9 % Contain package, type or member name java SimpleDateFormat Use barrier Contain terms like example, using, sample using currentdate code in jsp Coordination barrier
Assieme
Assieme packages/type s/members
Assieme relevance indicated by # uses
Assieme documentation
Assieme Filter search results
Assieme Summaries show referenced types
Assieme required libaries
Assieme example code
Assieme more info on types in sample code
Challenges How to put the right information? on the interface Get all programming-related data Interpret data and infer relationships
Outline Motivation What Programmers Search For The Assieme Search Engine Inferring Implicit References Using Implicit References for Scoring Evaluation of Inference & User Study Discussion & Conclusion
Assieme s Data Pages with code examples JAR files JavaDoc pages Queried Google on java ±import ±class Downloaded library files for all projects on Sun.com, Apache.org, Java.net, SourceForge.net Queried Google on overview-tree.html ~2,360,000 ~79,000 ~480,000 is crawled using existing search engines
The Assieme Search Engine infers 2 kinds of implicit references JAR files Pages with code examples Uses of packages, types and members Matches of packages, types and members JavaDoc pages?
Extracting Code Samples unclear segmentation code in a different language (C++) distracting terms in code line numbers
Extracting Code Samples remove HTML commands, but preserve line breaks remove some distracters by heuristics launch (error-tolerant) Java parser at every line break (separately parse for types, methods, and sequences <html> <head><title></title></hea d> A simple example: <body> A 1: import simple import java.util.*; example:<br><br> 2: class class c { c { 1: import 3: HashMap m = java.util.*; m new = new <br>2: class HashMap(); c {<br>3: HashMap m 4: void = void f() { new f() m.clear(); { } } HashMap();<br>4: 5: } } void f() { m.clear(); }<br>5: }<br><br> back <a href= index.html >back</a
Resolving External Code References Naïve approach of finding term matches does not work: 1 import java.util.*; 2 class c { 3 HashMap m = new HashMap(); 4 void f() { m.clear(); } 5 } Reference java.util.hashmap.clear() on line 4 only detectable by considering several lines? Use compiler to identify unresolved names
Resolving External Code References Index packages/types/members in JAR files Compile & lookup compile unresolved names java.util.hashmap.clear() java.util.hashmap index lookup JAR files JAR files Utility function: # covered references (and JAR popularity) greedily pick best JARs put on classpath
Scoring Existing techniques Docs modeled as weighted term frequencies Hypertext link analysis (PageRank) do not work well for code, because: JAR files (binary code) provide no context Source code contains few relevant keywords Structure in code important for relevance
Using Implicit References to Improve Scoring Assieme exploits structure on Web pages and structure in code HTML hyperlinks code references
Scoring APIs Use text on doc pages and on pages with code samples that reference API (~ anchor text) Weight APIs by #incoming refs (~ PageRank) Web Pages Use fully qualified references (java.util.hashmap) and adjust term weights Filter pages by references Favor pages with accompanying text
Outline Motivation What Programmers Search For The Assieme Search Engine Inferring Implicit References Using Implicit References for Scoring Evaluation of Inference & User Study Discussion & Conclusion
Evaluating Code Extraction and Reference Resolution on 350 hand-labeled pages from Assieme s data Code Extraction Recall 96.9%, Precision 50.1% ( 76.7%) False positives: C, C#, JavaScript, PHP, FishEye/diff (After filtering pages without refs: precision Reference 76.7%) Resolution Recall 89.6%, Precision 86.5% False positives: Fisheye and diff pages False negatives: incomplete code samples
User Study Assieme vs. Google vs. Google Code Search Design 40 search tasks based on queries in logs: query socket java Write a basic server that communicates using Sockets Find code samples (and required libraries) 4 blocks of 10 tasks: 1 for training + 1 per interface Participants 9 (under-)graduate students in Computer Science
User Study Solution Quality 0 seriously flawed.5 generally good but fell short in critical regard 1 fairly complete 1.0 F(1,258)=55.5 p <.0001 F(1,258)=6.29 p.013 * * quality (± SEM) 0.8 0.6 0.4 0.2 0.0 Assieme Google GCS
User Study # Queries Issued #queries (± SEM) 2.5 2.0 1.5 1.0 0.5 F(1,259)=9.77 p.002 F(1,259)=6.85 p.001 * * 0.0 Assieme Google GCS
User Study Task Time F(1,258)=5.74 p.017 * significant 150 F(1,258)=1.91 p.17 seconds (± SEM) 100 50 0 Assieme Google GCS
Outline Motivation What Programmers Search For The Assieme Search Engine Inferring Implicit References Using Implicit References for Scoring Evaluation of Inference & User Study Discussion & Conclusion
Discussion & Conclusion Assieme a novel web search interface Programmers obtain better solutions, using fewer queries, in the same amount of time Using Google subjects visited 3.3 pages/task, using Assieme only 0.27 pages, but 4.3 previews Ability to quickly view code samples changed participants strategies
Thank You Raphael Hoffmann Computer Science & Engineering University of Washington raphaelh@cs.washington.edu James Fogarty Computer Science & Engineering University of Washington jfogarty@cs.washington.edu Daniel S. Weld Computer Science & Engineering University of Washington weld@cs.washington.edu This material is based upon work supported by the National Science Foundation under grant IIS-0307906, by the Office of Naval Research under grant N00014-06-1-0147, SRI International under CALO grant 03-000225 and the Washington Research Foundation / TJ Cable Professorship.