Assieme: Finding and Leveraging Implicit References in a Web Search Interface for Programmers

Similar documents
Assieme: Finding and Leveraging Implicit References in a Web Search Interface for Programmers

Writing Servlets and JSPs p. 1 Writing a Servlet p. 1 Writing a JSP p. 7 Compiling a Servlet p. 10 Packaging Servlets and JSPs p.

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Web-based File Upload and Download System

DATA MINING - 1DL105, 1DL111

DATA MINING II - 1DL460. Spring 2014"

1) What is the first step of the system development life cycle (SDLC)? A) Design B) Analysis C) Problem and Opportunity Identification D) Development

Call: Core&Advanced Java Springframeworks Course Content:35-40hours Course Outline

COURSE SYLLABUS. Complete JAVA. Industrial Training (3 MONTHS) PH : , Vazhoor Road Changanacherry-01.

02/03/15. Compile, execute, debugging THE ECLIPSE PLATFORM. Blanks'distribu.on' Ques+ons'with'no'answer' 10" 9" 8" No."of"students"vs."no.

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

Review. Fundamentals of Website Development. Web Extensions Server side & Where is your JOB? The Department of Computer Science 11/30/2015

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

Chapter Two Bonus Lesson: JavaDoc

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

JavaScript: Introduction, Types

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Automated Generation of Event-Oriented Exploits in Android Hybrid Apps

ASCERTAINING THE RELEVANCE MODEL OF A WEB SEARCH-ENGINE BIPIN SURESH

DRACULA. CSM Turner Connor Taylor, Trevor Worth June 18th, 2015

LIST OF ACRONYMS & ABBREVIATIONS

JavaScript Context. INFO/CSE 100, Spring 2005 Fluency in Information Technology.

CSE Lecture 24 Review and Recap. High-Level Overview of the Course!! L1-7: I. Programming Basics!

CSE 421 Course Overview and Introduction to Java

Contextual Android Education

An Introduction to Search Engines and Web Navigation

Today we show how a search engine works

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

A System for Query-Specific Document Summarization

Frequently Asked Questions

Title: Artificial Intelligence: an illustration of one approach.

Web Design and Usability. What is usability? CSE 190 M (Web Programming) Spring 2007 University of Washington

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

Improving Collection Selection with Overlap Awareness in P2P Search Engines

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

Java Programming Course Overview. Duration: 35 hours. Price: $900

Automatic Identification of User Goals in Web Search [WWW 05]

Rubicon: Scalable Bounded Verification of Web Applications

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

Design of a Social Networking Analysis and Information Logger Tool

Ch04 JavaServer Pages (JSP)

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Social Networks 2015 Lecture 10: The structure of the web and link analysis

Fast And Robust Interface Generation for Ubiquitous Applications

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

CS6200 Information Retreival. Crawling. June 10, 2015

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

J2EE Technologies. Industrial Training

The Luxembourg BabelNet Workshop

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

Focused Crawling with

COURSE OUTLINE MOC 20488: DEVELOPING MICROSOFT SHAREPOINT SERVER 2013 CORE SOLUTIONS

Mining Web Data. Lijun Zhang

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES

Drexel Chatbot Requirements Specification

Workflow Exchange and Archival: The KSW File and the Kepler Object Manager. Shawn Bowers (For Chad Berkley & Matt Jones)

DEVELOPING MICROSOFT SHAREPOINT SERVER 2013 ADVANCED SOLUTIONS. Course: 20489A; Duration: 5 Days; Instructor-led

Web Application Development Using JEE, Enterprise JavaBeans and JPA

Java REPL Tutorial. -> System.out.println("Hi"); Hi

COMP 3400 Programming Project : The Web Spider

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

Information Retrieval Spring Web retrieval

Internet Client-Server Systems 4020 A

Announcements. 1. Class webpage: Have you been reading the announcements? Lecture slides and coding examples will be posted

DOWNLOAD OR READ : JAVA EE 6 WEB COMPONENT DEVELOPER CERTIFIED EXPERT MARATHON 1Z0 899 PRACTICE PROBLEMS PDF EBOOK EPUB MOBI

An Interactive Web based Expert System Degree Planner

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

DOC // JAVA TOMCAT WEB SERVICES TUTORIAL EBOOK

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

Data Presentation and Markup Languages

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Mainframe Adapter for SNA

A Look at Software Library Usage in Java. Jürgen Starek 2012

Connecting with Computer Science Chapter 5 Review: Chapter Summary:

Web Application Development Using JEE, Enterprise JavaBeans and JPA

Pace University. Fundamental Concepts of CS121 1

Finding Vulnerabilities in Web Applications

Joint Entity Resolution

COMP90015: Distributed Systems Assignment 1 Multi-threaded Dictionary Server (15 marks)

Mining Web Data. Lijun Zhang

Javadoc. Computer Science and Engineering College of Engineering The Ohio State University. Lecture 7

CoDocent: Support API Usage with Code Example and API Documentation

112-WL. Introduction to JSP with WebLogic

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Introduction to Web Application Development Using JEE, Frameworks, Web Services and AJAX

Information Retrieval

Error Received When Compiling Java Code Files Jasper Report

Introduction. October 5, Petr Křemen Introduction October 5, / 31

Information Retrieval and Web Search

Networked Applications: Sockets. Goals of Todayʼs Lecture. End System: Computer on the ʻNet. Client-server paradigm End systems Clients and servers

EPUB - JAVA PROGRAMMING GUI OPERATION MANUAL

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Web Architecture and Development

The Luxembourg BabelNet Workshop

CS-XXX: Graduate Programming Languages. Lecture 9 Simply Typed Lambda Calculus. Dan Grossman 2012

Agenda. Announcements. Extreme Java G Session 2 - Main Theme Java Tools and Software Engineering Techniques

Scala, Your Next Programming Language

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Transcription:

Assieme: Finding and Leveraging Implicit References in a Web Search Interface for Programmers Raphael Hoffmann, James Fogarty, Daniel S. Weld University of Washington, Seattle UW CSE Industrial Affiliates Meeting 2007

Programmers Use Search To identify an API To seek information about an API To find examples on how to use an API Example Task: Programmatically output an Acrobat PDF file in Java.

Example: General Web Search Interface

Example: Code-Specific Web Search Interface

Problems Information is dispersed: tutorials, API itself, documentation, pages with samples Difficult and time-consuming to locate required pieces, get an overview of alternatives, judge relevance and quality of results, understand dependencies. Many page visits required

With Assieme we Designed a new Web search interface Developed needed inference

Outline Motivation What Programmers Search For The Assieme Search Engine Inferring Implicit References Using Implicit References for Scoring Evaluation of Inference & User Study Discussion & Conclusion

Six Learning Barriers faced by Programmers (Ko et al. 04) Design barriers Selection barriers Coordination barriers combine? Use barriers What to do? What to use? How to How to use? Understanding barriers What is wrong? Information barriers How to

Examining Programmer Web Objective Queries See what programmers search for Dataset 15 million queries and click-through data Random sample of MSN queries in 05/06 Procedure Extract query sessions containing java 2,529 Manual looking at queries and defining regex filters Informal taxonomy of query sessions

Examining Programmer Web Queries

Examining Programmer Web Queries 64.1 % 35.9 % Descriptive java JSP current date Selection barrier 17.9 % Contain package, type or member name java SimpleDateFormat Use barrier Contain terms like example, using, sample using currentdate code in jsp Coordination barrier

Assieme

Assieme packages/type s/members

Assieme relevance indicated by # uses

Assieme documentation

Assieme Filter search results

Assieme Summaries show referenced types

Assieme required libaries

Assieme example code

Assieme more info on types in sample code

Challenges How to put the right information? on the interface Get all programming-related data Interpret data and infer relationships

Outline Motivation What Programmers Search For The Assieme Search Engine Inferring Implicit References Using Implicit References for Scoring Evaluation of Inference & User Study Discussion & Conclusion

Assieme s Data Pages with code examples JAR files JavaDoc pages Queried Google on java ±import ±class Downloaded library files for all projects on Sun.com, Apache.org, Java.net, SourceForge.net Queried Google on overview-tree.html ~2,360,000 ~79,000 ~480,000 is crawled using existing search engines

The Assieme Search Engine infers 2 kinds of implicit references JAR files Pages with code examples Uses of packages, types and members Matches of packages, types and members JavaDoc pages?

Extracting Code Samples unclear segmentation code in a different language (C++) distracting terms in code line numbers

Extracting Code Samples remove HTML commands, but preserve line breaks remove some distracters by heuristics launch (error-tolerant) Java parser at every line break (separately parse for types, methods, and sequences <html> <head><title></title></hea d> A simple example: <body> A 1: import simple import java.util.*; example:<br><br> 2: class class c { c { 1: import 3: HashMap m = java.util.*; m new = new <br>2: class HashMap(); c {<br>3: HashMap m 4: void = void f() { new f() m.clear(); { } } HashMap();<br>4: 5: } } void f() { m.clear(); }<br>5: }<br><br> back <a href= index.html >back</a

Resolving External Code References Naïve approach of finding term matches does not work: 1 import java.util.*; 2 class c { 3 HashMap m = new HashMap(); 4 void f() { m.clear(); } 5 } Reference java.util.hashmap.clear() on line 4 only detectable by considering several lines? Use compiler to identify unresolved names

Resolving External Code References Index packages/types/members in JAR files Compile & lookup compile unresolved names java.util.hashmap.clear() java.util.hashmap index lookup JAR files JAR files Utility function: # covered references (and JAR popularity) greedily pick best JARs put on classpath

Scoring Existing techniques Docs modeled as weighted term frequencies Hypertext link analysis (PageRank) do not work well for code, because: JAR files (binary code) provide no context Source code contains few relevant keywords Structure in code important for relevance

Using Implicit References to Improve Scoring Assieme exploits structure on Web pages and structure in code HTML hyperlinks code references

Scoring APIs Use text on doc pages and on pages with code samples that reference API (~ anchor text) Weight APIs by #incoming refs (~ PageRank) Web Pages Use fully qualified references (java.util.hashmap) and adjust term weights Filter pages by references Favor pages with accompanying text

Outline Motivation What Programmers Search For The Assieme Search Engine Inferring Implicit References Using Implicit References for Scoring Evaluation of Inference & User Study Discussion & Conclusion

Evaluating Code Extraction and Reference Resolution on 350 hand-labeled pages from Assieme s data Code Extraction Recall 96.9%, Precision 50.1% ( 76.7%) False positives: C, C#, JavaScript, PHP, FishEye/diff (After filtering pages without refs: precision Reference 76.7%) Resolution Recall 89.6%, Precision 86.5% False positives: Fisheye and diff pages False negatives: incomplete code samples

User Study Assieme vs. Google vs. Google Code Search Design 40 search tasks based on queries in logs: query socket java Write a basic server that communicates using Sockets Find code samples (and required libraries) 4 blocks of 10 tasks: 1 for training + 1 per interface Participants 9 (under-)graduate students in Computer Science

User Study Solution Quality 0 seriously flawed.5 generally good but fell short in critical regard 1 fairly complete 1.0 F(1,258)=55.5 p <.0001 F(1,258)=6.29 p.013 * * quality (± SEM) 0.8 0.6 0.4 0.2 0.0 Assieme Google GCS

User Study # Queries Issued #queries (± SEM) 2.5 2.0 1.5 1.0 0.5 F(1,259)=9.77 p.002 F(1,259)=6.85 p.001 * * 0.0 Assieme Google GCS

User Study Task Time F(1,258)=5.74 p.017 * significant 150 F(1,258)=1.91 p.17 seconds (± SEM) 100 50 0 Assieme Google GCS

Outline Motivation What Programmers Search For The Assieme Search Engine Inferring Implicit References Using Implicit References for Scoring Evaluation of Inference & User Study Discussion & Conclusion

Discussion & Conclusion Assieme a novel web search interface Programmers obtain better solutions, using fewer queries, in the same amount of time Using Google subjects visited 3.3 pages/task, using Assieme only 0.27 pages, but 4.3 previews Ability to quickly view code samples changed participants strategies

Thank You Raphael Hoffmann Computer Science & Engineering University of Washington raphaelh@cs.washington.edu James Fogarty Computer Science & Engineering University of Washington jfogarty@cs.washington.edu Daniel S. Weld Computer Science & Engineering University of Washington weld@cs.washington.edu This material is based upon work supported by the National Science Foundation under grant IIS-0307906, by the Office of Naval Research under grant N00014-06-1-0147, SRI International under CALO grant 03-000225 and the Washington Research Foundation / TJ Cable Professorship.