Ryen W. White, Matthew Richardson, Mikhail Bilenko Microsoft Research Allison Heath Rice University

Similar documents
Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites from User Activity Data

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

Chapter 8. Evaluating Search Engine

Retrieval Evaluation. Hongning Wang

Information Retrieval

CSCI 5417 Information Retrieval Systems. Jim Martin!

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

THIS LECTURE. How do we know if our results are any good? Results summaries: Evaluating a search engine. Making our good results usable to a user

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Automatic people tagging for expertise profiling in the enterprise

Solution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution

Evaluation. David Kauchak cs160 Fall 2009 adapted from:

CS54701: Information Retrieval

Ryen W. White, Peter Bailey, Liwei Chen Microsoft Corporation

Information Retrieval

Information Retrieval

CS47300: Web Information Search and Management

SUPPORTING SYNCHRONOUS SOCIAL Q&A THROUGHOUT THE QUESTION LIFECYCLE

A Task Level Metric for Measuring Web Search Satisfaction and its Application on Improving Relevance Estimation

Part 7: Evaluation of IR Systems Francesco Ricci

Marketing & Back Office Management

Information Retrieval

Information Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Information Retrieval

Detection and Extraction of Events from s

The Security Role for Content Analysis

Cost-Effective Combination of Multiple Rankers: Learning When Not To Query

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Learning Ranking Functions with Implicit Feedback

Information Retrieval Spring Web retrieval

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016

Ranking and Learning. Table of Content. Weighted scoring for ranking Learning to rank: A simple example Learning to ranking as classification.

Information Retrieval. Lecture 7

Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits

Deep Character-Level Click-Through Rate Prediction for Sponsored Search

Personalized Web Search

SEO for Healthcare Reaching Healthcare Consumers Through Search

Distress Image Library for Precision and Bias of Fully Automated Pavement Cracking Survey

Search Results Clustering in Polish: Evaluation of Carrot

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

SEO 1 8 O C T O B E R 1 7

The Sum of Its Parts: Reducing Sparsity in Click Estimation with Query Segments

Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval O. Chum, et al.

Information Retrieval

Information Retrieval. (M&S Ch 15)

Supporting Information

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Unstructured Data. CS102 Winter 2019

Search Engines. Information Retrieval in Practice

WebSci and Learning to Rank for IR

You got a website. Now what?

Ryen White, Susan Dumais, Jaime Teevan Microsoft Research. {ryenw, sdumais,

A Comparative Analysis of Cascade Measures for Novelty and Diversity

Detecting Spam Web Pages

Recommender Systems: Practical Aspects, Case Studies. Radek Pelánek

Search Engines. Informa1on Retrieval in Prac1ce. Annota1ons by Michael L. Nelson

Lizhe Sun. November 17, Florida State University. Ranking in Statistics and Machine Learning. Lizhe Sun. Introduction

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Website Audit Report

Webinar Series. Sign up at February 15 th. Website Optimization - What Does Google Think of Your Website?

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Automatic Summarization

Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University

Digital Marketing Overview of Digital Marketing Website Creation Search Engine Optimization What is Google Page Rank?

Web Spam Challenge 2008

Lecture #11: The Perceptron

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

6 WAYS Google s First Page

Multimedia Information Systems

Information Retrieval

Website minute read. Understand the business implications, tactics, costs, and creation process of an effective website.

Yahoo Search ATS Plugins. Daniel Morilha and Scott Beardsley

Diversification of Query Interpretations and Search Results

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

Searching the Deep Web

ICFHR 2014 COMPETITION ON HANDWRITTEN KEYWORD SPOTTING (H-KWS 2014)

Information Retrieval and Web Search

Reducing Click and Skip Errors in Search Result Ranking

Experiment Design and Evaluation for Information Retrieval Rishiraj Saha Roy Computer Scientist, Adobe Research Labs India

Search Engine Optimization Miniseries: Rich Website, Poor Website - A Website Visibility Battle of Epic Proportions

ELEVATESEO. INTERNET TRAFFIC SALES TEAM PRODUCT INFOSHEETS. JUNE V1.0 WEBSITE RANKING STATS. Internet Traffic

Detecting ads in a machine learning approach

DIGITAL MARKETING TRAINING. What is marketing and digital marketing? Understanding Marketing and Digital Marketing Process?

Raymond P.L. Buse and Westley R. Weimer. Presenter: Rag Mayur Chevuri

Tutorials Case studies

Crowdsourcing a News Query Classification Dataset. Richard McCreadie, Craig Macdonald & Iadh Ounis

How Big Data is Changing User Engagement

Evaluation of Retrieval Systems

CS6322: Information Retrieval Sanda Harabagiu. Lecture 13: Evaluation

Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

Anatomy of the Search Page

CS145: INTRODUCTION TO DATA MINING

Information Retrieval May 15. Web retrieval

Part I: Data Mining Foundations

Northeastern University in TREC 2009 Million Query Track

Transcription:

Ryen W. White, Matthew Richardson, Mikhail Bilenko Microsoft Research Allison Heath Rice University

Users are generally loyal to one engine Even when engine switching cost is low, and even when they are unhappy with search results Change can be inconvenient, users may be unaware of other engines A given search engine performs well for some queries and poorly for others Excessive loyalty can hinder search effectiveness

Support engine switching by recommending the most effective search engine for a given query Users can use their default but have another search engine suggested if it has better results

Switching support vs. meta-search Characterizing current search engine switching Supporting additional switching Evaluating switching support Conclusions and implications

Meta-search: Merges search results Requires change in default engine (< 1% share) Obliterates benefits from source engine UX investments Hurts source engine brand awareness We let users keep their default engine and suggest an alternative engine if we estimate it performs better for the current query

Pursued statistical clues on switching behavior Aims: Characterize switching Understand if switching would benefit users Extracted millions of search sessions from search logs Began with query to Google, Yahoo!, or Live Ended with 30 minutes of user inactivity

6.8% of sessions had switch 12% of sessions with > 1 query had switch Three classes of switching behavior: Within-session (33.4% users) Between-session (13.2% users) Switch for different sessions (engine task suitability?) Long-term (7.6% users) Defect with no return Most users are still loyal to a single engine

Quantify benefit of multiple engine use Important as users must benefit from switch Studied search sessions from search logs Evaluated engine performance with: Normalized Discounted Cumulative Gain (NDCG) Search result click-through rate 5K query test set, Goo/Yah/Live query freq. 5

Six-level relevance judgments, e.g., q =[black diamond carabiners] URL www.bdel.com/gear www.climbing.com/reviews/biners/black_diamond.html www.climbinggear.com/products/listing/item7588.asp www.rei.com/product/471041 www.nextag.com/black-diamond/ www.blackdiamondranch.com/ Rating Perfect Excellent Good Good Fair Bad We use NDCG at rank 3

Number (%) of 5K unique queries that each engine is best Search engine Relevance (NDCG) Result click-through rate X 952 (19.3%) 2,777 (56.4%) Y 1,136 (23.1%) 1,226 (24.9%) Z 789 (16.1%) 892 (18.1%) No difference 2,044 (41.5%) 26 (0.6%) Computed same stats on all instances of the queries in logs (not just unique queries) For around 50% of queries there was a different engine with better relevance or CTR Engine choice for each query is important

Users may benefit from recommendations Find a better engine for their query Model comparison as binary classification Closely mirrors the switching decision task Actual switch utility depends on cost/benefit Using a quality margin can help with this Quality difference must be margin Used a maximum-margin averaged perceptron

q Query Result page (origin) Result page (target) R ' R Q {( q, R, R', R*)} x Human-judged result set with k ordered URL-judgment pairs Utility of each engine for each query is represented by the NDCG score D {( x, y)} f ( q, R, R') NDCG R* {( d, s1),...,( d k, s 1 k )} U( R) NDCGR *( R) U( R') NDCG *( R') Provide switching support if utility higher by at least some margin Dataset of queries yields a set of training instances Where each instance y = 1 iff R* ( R* R R') NDCG ( ) R Offline Training margin

Classifier must recommend engine in real-time Feature generator needs to be fast Derive features from result pages and queryresult associations Features: Features from result pages Features from the query Features from the query-result page match

Result Page Features - e.g., 10 binary features indicating whether there are 1-10 results Number of results For each title and snippet: # of characters # of words # of HTML tags # of (indicate skipped text in snippet) # of. (indicates sentence boundary in snippet) # of characters in URL # of characters in domain (e.g., apple.com ) # of characters in URL path (e.g., download/quicktime.html ) # of characters in URL parameters (e.g.,?uid=45&p=2 ) 3 binary features: URL starts with http, ftp, or https 5 binary features: URL ends with html, aspx, php, htm 9 binary features:.com,.net,.org,.edu,.gov,.info,.tv,.biz,.uk # of / in URL path (i.e., depth of the path) # of & in URL path (i.e., number of parameters) # of = in URL path (i.e., number of parameters) # of matching documents (e.g., results 1-10 of 2375 )

Query Features - e.g., # of characters in query # of words in query # of stop words (a, an, the, ) 8 binary features: Is i th query token a stopword 8 features: word lengths (# chars) from smallest to largest 8 features: word lengths ordered from largest to smallest Average word length Match Features - e.g., For each text type (title, snippet, URL): # of results where the text contains the exact query # of top-1, top-2, top-3 results containing query # of query bigrams in the top-1, top-2, top-3, top-10 results # of domains containing the query in the top-1, top-2, top-3

Query Query Search Engine Federator Results Search Engines Result sets Feature Extractor Features Classifier (trained offline) Recommendation

Evaluate accuracy of switching support to determine its viability Task: Accurately predict when one search engine is better than another Ground truth: Used labeled corpus of queries randomly sampled from search engine logs Human judges evaluated several dozen top-ranked results returned by Google, Yahoo, and Live Search

Total number of queries 17,111 Total number of judged pages 4,254,730 Total number of judged pages labeled Fair or higher 1,378,011 10-fold cross validation, 100 runs, randomized fold assignment

Trade-offs (recall, interruption, error cost) Low confidence threshold = more erroneous recommendations, more frequent Preferable to interrupt user less often, with higher accuracy Use P-R curves rather than single accuracy point Prec. = # true positive / total # predicted positives Recall = # true positives / total # true positives Vary the confidence threshold to get P-R curve

Precision Precision Precision 1 0.95 X to Z Z to X 1 0.95 X to Y Y to X 1 0.95 Y to Z Z to Y 0.9 0.9 0.9 0.85 0.85 0.85 0.8 0.8 0.8 0.75 0.75 0.75 0.7 0.7 0.7 0.65 0.65 0.65 0.6 0.6 0.6 0.55 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall 0.55 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision low (~50%) at high recall levels Low threshold, equally accurate queries are viewed as switch-worthy Demonstrates the difficulty of the task Recall 0.55 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall

From Table 4. Summary of precision at recall=0.05. Goal is to provide additional value over current search engine Provide accurate switching suggestions Infrequent user interruption, every q not needed Summary of precision at recall=0.05. To X Y Z X 0.758 0.883 Y 0.811 0.816 Z 0.860 0.795 Classifier would fire accurately for 1 query in 20

Precision Querying additional engine may add network traffic, undesirable to target engine 0.9 0.85 X Y Z 0.8 0.75 0.7 0.65 0.6 0.55 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Accuracy lower, but latency may be less

Precision at Recall=0.05 Recall=0.5 Precision at Recall=0.5 Recall=0.05 1 0.95 0.9 All Results+Match R+M Query+Match Q+M Results+Query R+Q 1 0.95 0.9 All Query Q Results R Match M 0.85 0.85 0.8 0.8 0.75 0.75 0.7 0.7 0.65 0.65 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Margin 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 All sets of features contribute to accuracy Features obtained from result pages seems to provide the most benefit Margin

Demonstrated potential benefit of switching Described a method for automatically determining when to switch engines for a given query Evaluated the method and illustrated good performance, especially at usable recall Switching support is an important new research area that has potential to really help users

User studies: Task: Switching based on search task rather then just search queries Interruption: Understanding user focus of attention and willingness to be interrupted Cognitive burden of adapting to new engine