Large-Scale Validation and Analysis of Interleaved Search Evaluation

Similar documents
Learning with Humans in the Loop

Learning Ranking Functions with SVMs

Adaptive Search Engines Learning Ranking Functions with SVMs

Modern Retrieval Evaluations. Hongning Wang

Learning Ranking Functions with SVMs

Evaluating the Accuracy of. Implicit feedback. from Clicks and Query Reformulations in Web Search. Learning with Humans in the Loop

Learning to Rank (part 2)

Learning Ranking Functions with Implicit Feedback

A Large Scale Validation and Analysis of Interleaved Search Evaluation

Machine Learning for Information Discovery

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Learning Socially Optimal Information Systems from Egoistic Users

A Dynamic Bayesian Network Click Model for Web Search Ranking

TREC OpenSearch Planning Session

Using Historical Click Data to Increase Interleaving Sensitivity

Minimally Invasive Randomization for Collecting Unbiased Preferences from Clickthrough Logs

Optimized Interleaving for Online Retrieval Evaluation

Retrieval Evaluation. Hongning Wang

Evaluating Aggregated Search Using Interleaving

Geometric Modeling. Mesh Decimation. Mesh Decimation. Applications. Copyright 2010 Gotsman, Pauly Page 1. Oversampled 3D scan data

Comparative Analysis of Clicks and Judgments for IR Evaluation

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016

Evaluating Retrieval Performance using Clickthrough Data

Learning Dense Models of Query Similarity from User Click Logs

T MULTIMEDIA RETRIEVAL SYSTEM EVALUATION

Learning to Rank for Faceted Search Bridging the gap between theory and practice

Comparing the Sensitivity of Information Retrieval Metrics

Information Retrieval

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014

A Task Level Metric for Measuring Web Search Satisfaction and its Application on Improving Relevance Estimation

Personalized Web Search

FFRDC Team s Expert Elicitation

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

arxiv: v1 [cs.ir] 11 Oct 2018

Fighting Search Engine Amnesia: Reranking Repeated Results

Estimating Credibility of User Clicks with Mouse Movement and Eye-tracking Information

A Comparative Analysis of Cascade Measures for Novelty and Diversity

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Ryen W. White, Peter Bailey, Liwei Chen Microsoft Corporation

Performance Measures for Multi-Graded Relevance

University of Delaware at Diversity Task of Web Track 2010

Analysis of Trail Algorithms for User Search Behavior

Search Engines. Informa1on Retrieval in Prac1ce. Annota1ons by Michael L. Nelson

Efficient Multiple-Click Models in Web Search

Relevance and Effort: An Analysis of Document Utility

Learning to Rank. Tie-Yan Liu. Microsoft Research Asia CCIR 2011, Jinan,

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track

A Comparative Study of Click Models for Web Search

Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

CITS4009 Introduction to Data Science

Information Retrieval

Advances on the Development of Evaluation Measures. Ben Carterette Evangelos Kanoulas Emine Yilmaz

Machine Learning In Search Quality At. Painting by G.Troshkov

Optimizing Search Engines using Click-through Data

Crowdsourcing a News Query Classification Dataset. Richard McCreadie, Craig Macdonald & Iadh Ounis

Learning Temporal-Dependent Ranking Models

A Comparative Analysis of Interleaving Methods for Aggregated Search

GEARMOTORS AF MOTORS FOR INVERTER C-1

Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites from User Activity Data

arxiv: v1 [cs.cv] 2 Sep 2018

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

On Composition of a Federated Web Search Result Page: Using Online Users to Provide Pairwise Preference for Heterogeneous Verticals

CSCI 5417 Information Retrieval Systems. Jim Martin!

Discrete Mathematics and Probability Theory Fall 2013 Midterm #2

Evaluating Search Engines by Clickthrough Data

CS 6740: Advanced Language Technologies April 2, Lecturer: Lillian Lee Scribes: Navin Sivakumar, Lakshmi Ganesh, Taiyang Chen.

Monitoring and Improving Quality of Data Handling

Examining the Limits of Crowdsourcing for Relevance Assessment

Lizhe Sun. November 17, Florida State University. Ranking in Statistics and Machine Learning. Lizhe Sun. Introduction

Evaluation of Retrieval Systems

CS47300: Web Information Search and Management

Information Filtering for arxiv.org

Chapter 8. Evaluating Search Engine

Active Evaluation of Ranking Functions based on Graded Relevance (Extended Abstract)

Reducing Click and Skip Errors in Search Result Ranking

Adapting Ranking Functions to User Preference

Detecting Good Abandonment in Mobile Search

Apache Solr Learning to Rank FTW!

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

On Duplicate Results in a Search Session

Understanding How People Interact with Web Search Results that Change in Real-Time Using Implicit Feedback

25 Suggested Time: 30 min

Hybrid Approach for Query Expansion using Query Log

Search Quality Evaluation Tools and Techniques

Semi-supervised learning and active learning

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

Modern Information Retrieval

April 3, 2012 T.C. Havens

Lab 9. Julia Janicki. Introduction

Administrative Guidance for Internally Assessed Units

Web Search and Text Mining. Lecture 8

Large Scale 3D Reconstruction by Structure from Motion

Information Retrieval. Lecture 7

From Clicks to Models. The Wikimedia LTR Pipeline

WebSci and Learning to Rank for IR

User Exploration of Slider Facets in Interactive People Search System

Jure Leskovec, Cornell/Stanford University. Joint work with Kevin Lang, Anirban Dasgupta and Michael Mahoney, Yahoo! Research

CMP 2 Grade Mathematics Curriculum Guides

Transcription:

Large-Scale Validation and Analysis of Interleaved Search Evaluation Olivier Chapelle, Thorsten Joachims Filip Radlinski, Yisong Yue Department of Computer Science Cornell University

Decide between two Ranking Distribution P(u,q) of users u, queries q Functions (tj, SVM ) Retrieval Function 1 Which one Retrieval Function 2 f 1 (u,q) r 1 is better? f 2 (u,q) r 2 1. Kernel Machines http://svm.first.gmd.de/ 2. SVM-Light Support Vector Machine http://svmlight.joachims.org/ 3. School of Veterinary Medicine at UPenn http://www.vet.upenn.edu/ 4. An Introduction to Support Vector Machines http://www.support-vector.net/ 5. Service Master Company http://www.servicemaster.com/ 1. School of Veterinary Medicine at UPenn http://www.vet.upenn.edu/ 2. Service Master Company http://www.servicemaster.com/ 3. Support Vector Machine http://jbolivar.freeservers.com/ 4. Archives of SUPPORT-VECTOR-MACHINES http://www.jiscmail.ac.uk/lists/support... 5. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ U(tj, SVM,r 1 ) U(tj, SVM,r 2 )

Implicit Utility Feedback Approach 1: Absolute Metrics Do metrics derived from observed user behavior provide absolute feedback about retrieval quality of f? For example: U(f) ~ numclicks(f) U(f) ~ 1/abandonment(f) Approach 2: Paired Comparison Tests Do paired comparison tests provide relative preferences between two retrieval functions f 1 and f 2? For example: f 1 Â f 2 pairedcomptest(f 1, f 2 ) > 0

Absolute Metrics: Metrics Name Description Aggregation Abandonment Rate % of queries with no click N/A Increase Reformulation Rate Queries per Session % of queries that are followed by reformulation Session = no interruption of more than 30 minutes N/A Mean Hypothesized Change with Decreased Quality Increase Increase Clicks per Query Number of clicks Mean Decrease Click@1 % of queries with clicks at position 1 N/A Decrease Max Reciprocal Rank* 1/rank for highest click Mean Decrease Mean Reciprocal Rank* Mean of 1/rank for all clicks Mean Decrease Time to First Click* Seconds before first click Median Increase Time to Last Click* Seconds before final click Median Decrease (*) only queries with at least one click count

How does User Behavior Reflect User Study in ArXiv.org Natural user and query population User in natural context, not lab Live and operational search engine Ground truth by construction ORIG Â SWAP2 Â SWAP4 ORIG: Hand-tuned fielded SWAP2: ORIG with 2 pairs swapped SWAP4: ORIG with 4 pairs swapped ORIG Â FLAT Â RAND ORIG: Hand-tuned fielded FLAT: No field weights RAND : Top 10 of FLAT shuffled Retrieval Quality?

Absolute Metrics: Experiment Setup Experiment Setup Phase I: 36 days Users randomly receive ranking from Orig, Flat, Rand Phase II: 30 days Users randomly receive ranking from Orig, Swap2, Swap4 User are permanently assigned to one experimental condition based on IP address and browser. Basic Statistics ~700 queries per day / ~300 distinct users per day Quality Control and Data Cleaning Test run for 32 days Heuristics to identify bots and spammers All evaluation code was written twice and cross-validated

Absolute Metrics: Results 2.5 2 1.5 1 0.5 0 Absolute Metrics: ORIG Summary and Conclusions FLAT RAND None of the absolute metrics ORIGreflects expected order. SWAP2 SWAP4 Most differences not significant after one month of data. Absolute metrics not suitable for ArXiv-sized search engines.

Yahoo! Search: Results Retrieval Functions 4 variants of production retrieval function Data 10M 70M queries for each retrieval function Expert relevance judgments Results Still not always significant even after more than 10M queries per function Only Click@1 consistent with DCG@5. Time to Last Click D>C D>B Time to First Click C>B D>A Mean Reciprocal Rank Max Reciprocal Rank C>A B>A pskip Clicks@1 Clicks per Query Abandonment Rate DCG5-1 -0.5 0 0.5 1 [Chapelle et al., 2012]

Approaches to Utility Elicitation Approach 1: Absolute Metrics Do metrics derived from observed user behavior provide absolute feedback about retrieval quality of f? For example: U(f) ~ numclicks(f) U(f) ~ 1/abandonment(f) Approach 2: Paired Comparison Tests Do paired comparison tests provide relative preferences between two retrieval functions f 1 and f 2? For example: f 1 Â f 2 pairedcomptest(f 1, f 2 ) > 0

Paired Comparisons: What to Measure? (u=tj, q= svm ) f 1 (u,q) r 1 f 2 (u,q) r 2 1. Kernel Machines http://svm.first.gmd.de/ 2. Support Vector Machine http://jbolivar.freeservers.com/ 3. An Introduction to Support Vector Machines http://www.support-vector.net/ 4. Archives of SUPPORT-VECTOR-MACHINES... http://www.jiscmail.ac.uk/lists/support... 5. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ 1. Kernel Machines http://svm.first.gmd.de/ 2. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ 3. Support Vector Machine and Kernel... References http://svm.research.bell-labs.com/svmrefs.html 4. Lucent Technologies: SVM demo applet http://svm.research.bell-labs.com/svt/svmsvt.html 5. Royal Holloway Support Vector Machine http://svm.dcs.rhbnc.ac.uk Interpretation: (r 1 Â r 2 ) clicks(r 1 ) > clicks(r 2 )

Paired Comparison: Balanced Interleaving (u=tj, q= svm ) f 1 (u,q) r 1 f 2 (u,q) r 2 1. Kernel Machines http://svm.first.gmd.de/ 2. Support Vector Machine http://jbolivar.freeservers.com/ 3. An Introduction to Support Vector Machines http://www.support-vector.net/ 4. Archives of SUPPORT-VECTOR-MACHINES... http://www.jiscmail.ac.uk/lists/support... 5. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ Model of User: Better retrieval functions is more likely to get more clicks. Interleaving(r 1,r 2 ) 1. Kernel Machines 1 http://svm.first.gmd.de/ 2. Support Vector Machine 2 http://jbolivar.freeservers.com/ 3. SVM-Light Support Vector Machine 2 http://ais.gmd.de/~thorsten/svm light/ 4. An Introduction to Support Vector Machines 3 http://www.support-vector.net/ 5. Support Vector Machine and Kernel... References 3 http://svm.research.bell-labs.com/svmrefs.html 6. Archives of SUPPORT-VECTOR-MACHINES... 4 http://www.jiscmail.ac.uk/lists/support... 7. Lucent Technologies: SVM demo applet 4 http://svm.research.bell-labs.com/svt/svmsvt.html 1. Kernel Machines http://svm.first.gmd.de/ 2. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ 3. Support Vector Machine and Kernel... References http://svm.research.bell-labs.com/svmrefs.html 4. Lucent Technologies: SVM demo applet http://svm.research.bell-labs.com/svt/svmsvt.html 5. Royal Holloway Support Vector Machine http://svm.dcs.rhbnc.ac.uk Invariant: For all k, top k of balanced interleaving is union of top k 1 of r 1 and top k 2 of r 2 with k 1 =k 2 ± 1. Interpretation: (r 1 Â r 2 ) clicks(topk(r 1 )) > clicks(topk(r 2 )) see also [Radlinski, Craswell, 2012] [Hofmann, 2012] [Joachims, 2001] [Radlinski et al., 2008]

Balanced Interleaving: a Problem Example: Two rankings r 1 and r 2 that are identical up to one insertion (X) Random user clicks uniformly on results in interleaved ranking 1. X r 2 wins 2. A r 1 wins 3. B r 1 wins 4. C r 1 wins 5. D r 1 wins biased r 1 r 2 A B C D X 1 A 1 B 2 C 3 D 4 X A B C

Paired Comparisons: Team-Game Interleaving (u=tj,q= svm ) f 1 (u,q) r 1 f 2 (u,q) r 2 1. Kernel Machines http://svm.first.gmd.de/ 2. Support Vector Machine http://jbolivar.freeservers.com/ 3. An Introduction to Support Vector Machines http://www.support-vector.net/ 4. Archives of SUPPORT-VECTOR-MACHINES... http://www.jiscmail.ac.uk/lists/support... 5. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ NEXT PICK Interleaving(r 1,r 2 ) 1. Kernel Machines T2 http://svm.first.gmd.de/ 2. Support Vector Machine T1 http://jbolivar.freeservers.com/ 3. SVM-Light Support Vector Machine T2 http://ais.gmd.de/~thorsten/svm light/ 4. An Introduction to Support Vector Machines T1 http://www.support-vector.net/ 5. Support Vector Machine and Kernel... References T2 http://svm.research.bell-labs.com/svmrefs.html 6. Archives of SUPPORT-VECTOR-MACHINES... T1 http://www.jiscmail.ac.uk/lists/support... 7. Lucent Technologies: SVM demo applet T2 http://svm.research.bell-labs.com/svt/svmsvt.html 1. Kernel Machines http://svm.first.gmd.de/ 2. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ 3. Support Vector Machine and Kernel... References http://svm.research.bell-labs.com/svmrefs.html 4. Lucent Technologies: SVM demo applet http://svm.research.bell-labs.com/svt/svmsvt.html 5. Royal Holloway Support Vector Machine http://svm.dcs.rhbnc.ac.uk Invariant: For all k, in expectation same number of team members in top k from each team. Interpretation: (r 1 Â r 2 ) clicks(t 1 ) > clicks(t 2 )

Experiment Setup Phase I: 36 days Paired Comparisons: Experiment Setup Balanced Interleaving of (Orig,Flat) (Flat,Rand) (Orig,Rand) Phase II: 30 days Balanced Interleaving of (Orig,Swap2) (Swap2,Swap4) (Orig,Swap4) Quality Control and Data Cleaning Same as for absolute metrics

Percent Wins Balanced Interleaving: Results 45 40 35 30 25 20 15 10 5 0 % wins ORIG % wins RAND Paired Comparison Tests: Summary and Conclusions All interleaving experiments reflect the expected order. All differences are significant after one month of data. Same results also for alternative data-preprocessing.

Percent Wins Team-Game Interleaving: Results 60 50 40 30 20 10 0 Paired Comparison Tests: Summary and Conclusions All interleaving experiments reflect the expected order. Results similar to Balanced Interleaving. Most differences are significant after one month of data.

Yahoo and Bing: Interleaving Results Yahoo Web Search [Chapelle et al., 2012] Four retrieval functions (i.e. 6 paired comparisons) Balanced Interleaving All paired comparisons consistent with ordering by NDCG. Bing Web Search [Radlinski & Craswell, 2010] Five retrieval function pairs Team-Game Interleaving Consistent with ordering by NDGC when NDCG significant.

Efficiency: Interleaving vs. Absolute Yahoo Web Search More than 10M queries for absolute measures Approx 700k queries for interleaving Experiment REPEAT Draw bootstrap sample S of size x Evaluate metric on S for pair (P,Q) of retrieval functions Estimate y = P(P > m Q x) Interleaving by factor ~10 more efficient than Click@1. [Chapelle, Joachims, Radlinski, Yue, to appear]

Efficiency: Interleaving vs. Explicit Bing Web Search 4 retrieval function pairs ~12k manually judged queries ~200k interleaved queries Experiment p = probability that NDCG is correct on subsample of size y x = number of queries needed to reach same p-value with interleaving Ten interleaved queries are equivalent to one manually judged query. [Radlinski & Craswell, 2010]

Summary and Conclusions Interleaving agrees better with expert assessment than absolute metrics Design as pairwise comparison All interleaving techniques seem to do roughly equally well Efficiency of interleaving compared to expert assessment and Click@1