A Comparative Analysis of Cascade Measures for Novelty and Diversity

Similar documents
Overview of the TREC 2010 Web Track

5. Novelty & Diversity

Overview of the TREC 2009 Web Track

University of Delaware at Diversity Task of Web Track 2010

TREC 2013 Web Track Overview

Reducing Redundancy with Anchor Text and Spam Priors

Increasing evaluation sensitivity to diversity

An Investigation of Basic Retrieval Models for the Dynamic Domain Task

Diversification of Query Interpretations and Search Results

Overview of the TREC 2013 Crowdsourcing Track

Efficient Diversification of Web Search Results

Retrieval Evaluation. Hongning Wang

Information Retrieval

A Study of Pattern-based Subtopic Discovery and Integration in the Web Track

Advances on the Development of Evaluation Measures. Ben Carterette Evangelos Kanoulas Emine Yilmaz

Evaluation of Retrieval Systems

Evaluation of Retrieval Systems

Dealing with Incomplete Judgments in Cascade Measures

Do User Preferences and Evaluation Measures Line Up?

Coverage-based search result diversification

Evaluation of Retrieval Systems

Intent-Aware Search Result Diversification

The 1st International Workshop on Diversity in Document Retrieval

TREC 2014 Web Track Overview

Chapter 8. Evaluating Search Engine

Microsoft Research Asia at the Web Track of TREC 2009

Experiments with ClueWeb09: Relevance Feedback and Web Tracks

Similarity and Diversity in Information Retrieval

Do user preferences and evaluation measures line up?

Robust Relevance-Based Language Models

Information Retrieval

Information Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007

The TREC 2005 Terabyte Track

A Dynamic Bayesian Network Click Model for Web Search Ranking

ICTNET at Web Track 2010 Diversity Task

The Water Filling Model and The Cube Test: Multi-Dimensional Evaluation for Professional Search

On Duplicate Results in a Search Session

NTU Approaches to Subtopic Mining and Document Ranking at NTCIR-9 Intent Task

CSCI 5417 Information Retrieval Systems. Jim Martin!

Exploiting the Diversity of User Preferences for Recommendation. Saúl Vargas and Pablo Castells {saul.vargas,

Search Result Diversity Evaluation based on Intent Hierarchies

Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task. Junjun Wang 2013/4/22

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

THIS LECTURE. How do we know if our results are any good? Results summaries: Evaluating a search engine. Making our good results usable to a user

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

PERSONALIZATION AND DIVERSIFICATION OF SEARCH RESULTS. Naveen Kumar

Proceedings of NTCIR-9 Workshop Meeting, December 6-9, 2011, Tokyo, Japan

Combining Implicit and Explicit Topic Representations for Result Diversification

Heading-aware Snippet Generation for Web Search

IRCE at the NTCIR-12 IMine-2 Task

Promoting Ranking Diversity for Biomedical Information Retrieval based on LDA

Information Retrieval

Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Improving Patent Search by Search Result Diversification

Evaluation. David Kauchak cs160 Fall 2009 adapted from:

Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task

Building a Microblog Corpus for Search Result Diversification

On Duplicate Results in a Search Session

Exploiting the Diversity of User Preferences for Recommendation

Beyond Independent Relevance: Methods and Evaluation Metrics for Subtopic Retrieval

mnir: Diversifying Search Results based on a Mixture of Novelty, Intention and Relevance

Information Retrieval

International Journal of Computer Engineering and Applications, Volume IX, Issue X, Oct. 15 ISSN

Information Retrieval. (M&S Ch 15)

Web Information Retrieval. Exercises Evaluation in information retrieval

Modern Retrieval Evaluations. Hongning Wang

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Ryen W. White, Matthew Richardson, Mikhail Bilenko Microsoft Research Allison Heath Rice University

DivQ: Diversification for Keyword Search over Structured Databases

Information Retrieval. Lecture 7

{david.vallet,

Beyond Independent Relevance: Methods and Evaluation Metrics for Subtopic Retrieval

A Document-centered Approach to a Natural Language Music Search Engine

Information Retrieval and Web Search

Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied

An Analysis of NP-Completeness in Novelty and Diversity Ranking

A Survey On Diversification Techniques For Unabmiguous But Under- Specified Queries

Lizhe Sun. November 17, Florida State University. Ranking in Statistics and Machine Learning. Lizhe Sun. Introduction

Part 7: Evaluation of IR Systems Francesco Ricci

Modeling Rich Interac1ons in Session Search Georgetown University at TREC 2014 Session Track

Information Retrieval

Building Test Collections. Donna Harman National Institute of Standards and Technology

Northeastern University in TREC 2009 Million Query Track

Discounted Cumulated Gain based Evaluation of Multiple Query IR Sessions

Large-Scale Validation and Analysis of Interleaved Search Evaluation

Modeling Document Novelty with Neural Tensor Network for Search Result Diversification

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016

ICFHR 2014 COMPETITION ON HANDWRITTEN KEYWORD SPOTTING (H-KWS 2014)

Window Extraction for Information Retrieval

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

Northeastern University in TREC 2009 Web Track

Focused Retrieval Using Topical Language and Structure

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

Performance Measures for Multi-Graded Relevance

Overview of the NTCIR-13 OpenLiveQ Task

University of TREC 2009: Indexing half a billion web pages

Tansu Alpcan C. Bauckhage S. Agarwal

Addressing the Challenges of Underspecification in Web Search. Michael Welch

Diversity in Recommender Systems Week 2: The Problems. Toni Mikkola, Andy Valjakka, Heng Gui, Wilson Poon

Ranking with Query-Dependent Loss for Web Search

Transcription:

A Comparative Analysis of Cascade Measures for Novelty and Diversity Charles Clarke, University of Waterloo Nick Craswell, Microsoft Ian Soboroff, NIST Azin Ashkan, University of Waterloo

Background Measuring Web search effectiveness: Editorial judgments (e.g., ndcg) Implicit feedback (e.g., click curves) Focused user studies Limitations: we consider editorial measures only we assume ranked lists

Novelty and Diversity Diversity: reflects the variety of user needs underlying a query Novelty: reflects the variety of information appearing in a search engine result page Traditional measures ignore redundancy and inter-document relevance. Our goal is to validate tractable measures that address these issues.

TREC Web Track (2009-2011) Provides: A framework for exploring novelty and diversity in Web search (1TB ClueWeb09 document collection, 50 topics/queries). Goal: Return a ranked list of documents in order of decreasing probability of relevance, where relevance is considered in the context of higher-ranked documents. Evaluation: Assessed through binary judgments made with respect to explicit sub-topics.

<topic number="55" type="ambiguous"> <query>iron</query> participants given only the query <description> Find information about iron as an essential nutrient. </description> <subtopic number="1" type="inf"> Find information about iron as an essential nutrient. </subtopic> <subtopic number="2" type="inf"> What foods contain iron? </subtopic> <subtopic number="3" type="nav"> Find sites where I can buy iron supplements. </subtopic> results judged <subtopic number="4" type="inf"> independently Find information about the element iron (Fe). </subtopic> with respect to <subtopic number="5" type="inf"> each subtopic Find information about iron deficiencies. </subtopic> <subtopic number="6" type="nav"> Find dealers in irons for clothing. </subtopic> </topic>

<topic number="73" type="faceted"> <query>neil young</query> <description> Find music, tour dates, and information about the musician Neil Young. </description> <subtopic number="1" type="nav"> Find albums by Neil Young to buy. </subtopic> <subtopic number="2" type="inf"> Find biographical information about Neil Young. </subtopic> <subtopic number="3" type="nav"> Find lyrics or sheet music for Neil Young's songs. </subtopic> <subtopic number="4" type="nav"> Find a list of Neil Young tour dates. </subtopic> </topic>

iron

Subtopic recall: Measuring Novelty simple measure often used in early work e.g, Zhai et al. (2003), Zhang et al. (2005) Intent aware versions of traditional measures: Agrawal et al. (2009): ndcg-ia, MAP-IA Intent aware cascade measures: Clarke et al. (2008): α-ndcg Chapelle et al (2009): ERR Clarke et al. (2009): NRBP Plus: Sakai et al. (2010), Rafiei et al. (2010), etc.

Subtopic Recall vs. Precision (TREC 2009) Top-performing groups at TREC 2009 exhibit a mixture of outcomes, with some groups excelling at novelty and others excelling at traditional relevance.

Intent Aware Measures score for a topic is a weighted average over subtopics (probabilities need not sum to 1) measures applied independently to each of M subtopics (e.g., ndcg, Map, etc.)

Intent Aware Measures Intent aware versions of traditional measures (MAP, ndcg, etc.) do not penalize redundancy. To achieve best performance under these traditional measures, the search engine should focus on the most popular topic. In contrast, cascade measures explicitly penalize redundancy.

Intent Aware Cascade Measures normalization gain weighted average over subtopics discount

Gain Penalizes Redundancy α represents the user s tolerance for redundancy

Discount Penalizes Depth (k) α-ndcg (Clarke et al., 2008) ERR (Chapelle et al., 2009) NRBP (Clarke et al., 2009)

Normalization Controls for Number of Subtopics Scores should be normalized into the range [0:1] for averaging. Value of highest possible score can grow with number of subtopics. Normalization either collection independent collection dependent NP hard. See Carterette (2009) for discussion.

Impact of Varying α on α-ndcg

Impact of Varying α on nerr

Impact of Varying α on nnrbp

MAP-IA vs. α-ndcg

nerr vs. α-ndcg

nnrbp vs. α-ndcg

Impact of Normalization on α-(n)dcg

Discriminative Power

Conclusions Unified framework for intent-aware measures, especially cascade measures. Intent aware measures must penalize redundancy. Measures provide a trade-off between traditional precision and subtopic recall. Collection dependent normalization may not be required. But discriminative power greatest with MAP. Questions?