Information Retrieval

Similar documents
Information Retrieval

Retrieval Evaluation. Hongning Wang

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

Chapter 8. Evaluating Search Engine

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

THIS LECTURE. How do we know if our results are any good? Results summaries: Evaluating a search engine. Making our good results usable to a user

Information Retrieval. Lecture 7

CSCI 5417 Information Retrieval Systems. Jim Martin!

Information Retrieval and Web Search

Information Retrieval

Search Engines. Informa1on Retrieval in Prac1ce. Annota1ons by Michael L. Nelson

Part 7: Evaluation of IR Systems Francesco Ricci

CS6322: Information Retrieval Sanda Harabagiu. Lecture 13: Evaluation

Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University

Information Retrieval

Diversification of Query Interpretations and Search Results

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Evaluation of Retrieval Systems

A Comparative Analysis of Cascade Measures for Novelty and Diversity

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Ryen W. White, Matthew Richardson, Mikhail Bilenko Microsoft Research Allison Heath Rice University

Information Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007

Information Retrieval

Evaluation. David Kauchak cs160 Fall 2009 adapted from:

ICFHR 2014 COMPETITION ON HANDWRITTEN KEYWORD SPOTTING (H-KWS 2014)

Retrieval Evaluation

Discounted Cumulated Gain based Evaluation of Multiple Query IR Sessions

Advances on the Development of Evaluation Measures. Ben Carterette Evangelos Kanoulas Emine Yilmaz

Performance Measures for Multi-Graded Relevance

Experiment Design and Evaluation for Information Retrieval Rishiraj Saha Roy Computer Scientist, Adobe Research Labs India

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Informa(on Retrieval

Evaluation of Retrieval Systems

Evaluation of Retrieval Systems

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

User Logs for Query Disambiguation

5. Novelty & Diversity

WebSci and Learning to Rank for IR

Information Retrieval

Beyond Independent Relevance: Methods and Evaluation Metrics for Subtopic Retrieval

Evaluation of Retrieval Systems

Personalized Web Search

CS47300: Web Information Search and Management

CMPSCI 646, Information Retrieval (Fall 2003)

The TREC 2005 Terabyte Track

Modern Retrieval Evaluations. Hongning Wang

Document indexing, similarities and retrieval in large scale text collections

Reducing Redundancy with Anchor Text and Spam Priors

Ranking Assessment of Event Tweets for Credibility

Chapter III.2: Basic ranking & evaluation measures

Lizhe Sun. November 17, Florida State University. Ranking in Statistics and Machine Learning. Lizhe Sun. Introduction

Informa(on Retrieval

This lecture. Measures for a search engine EVALUATING SEARCH ENGINES. Measuring user happiness. Measures for a search engine

PERSONALIZED TAG RECOMMENDATION

Recommender Systems: Practical Aspects, Case Studies. Radek Pelánek

Information Retrieval. (M&S Ch 15)

Benchmarks, Performance Evaluation and Contests for 3D Shape Retrieval

A New Measure of the Cluster Hypothesis

Modeling Expected Utility of Multi-session Information Distillation

Learning Ranking Functions with Implicit Feedback

Multimedia Information Systems

Assignment 1. Assignment 2. Relevance. Performance Evaluation. Retrieval System Evaluation. Evaluate an IR system

CISC689/ Information Retrieval Midterm Exam

Lecture 5: Information Retrieval using the Vector Space Model

Apache Solr Learning to Rank FTW!

The Use of Inverted Index to Information Retrieval: ADD Intelligent in Aviation Case Study

Query Processing and Alternative Search Structures. Indexing common words

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams

CS224W Final Report Emergence of Global Status Hierarchy in Social Networks

An Investigation of Basic Retrieval Models for the Dynamic Domain Task

Active Evaluation of Ranking Functions based on Graded Relevance (Extended Abstract)

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

A Task Level Metric for Measuring Web Search Satisfaction and its Application on Improving Relevance Estimation

Overview of the NTCIR-13 OpenLiveQ Task

with BLENDER: Enabling Local Search a Hybrid Differential Privacy Model

A Simple and Efficient Sampling Method for Es7ma7ng AP and ndcg

A Class of Submodular Functions for Document Summarization

Ranking and Learning. Table of Content. Weighted scoring for ranking Learning to rank: A simple example Learning to ranking as classification.

Session 10: Information Retrieval

Information Retrieval. Techniques for Relevance Feedback

An Exploration of Query Term Deletion

MAY 29 and MAY 30. What s New? When is it Happening? Here are the new features most relevent to CRC users:

Efficient Diversification of Web Search Results

Tabularized Search Results

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Sec. 8.7 RESULTS PRESENTATION

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

Fall Lecture 16: Learning-to-rank

Crowdsourcing a News Query Classification Dataset. Richard McCreadie, Craig Macdonald & Iadh Ounis

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Learning to Rank: A New Technology for Text Processing

Predictive Indexing for Fast Search

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

Information Retrieval CSCI

Academic Paper Recommendation Based on Heterogeneous Graph

Mobile Friendly Website. Checks whether your website is responsive. Out of 10. Out of Social

Dynamically Building Facets from Their Search Results

Transcription:

Introduction to Information Retrieval Evaluation

Rank-Based Measures Binary relevance Precision@K (P@K) Mean Average Precision (MAP) Mean Reciprocal Rank (MRR) Multiple levels of relevance Normalized Discounted Cumulative Gain (NDCG)

Precision@K Set a rank threshold K Compute % relevant in top K Ignores documents ranked lower than K Ex: Prec@3 of /3 Prec@4 of /4 Prec@5 of 3/5

Mean Average Precision Consider rank position of each relevant doc K 1, K, K R Compute Precision@K for each K 1, K, K R Average precision = average of P@K Ex: has AvgPrec of 1 1 MAP is Average Precision across multiple queries/rankings 1 3 3 3 5 0.76

Average Precision

MAP

Mean average precision If a relevant document never gets retrieved, we assume the precision corresponding to that relevant doc to be zero MAP is macro-averaging: each query counts equally Now perhaps most commonly used measure in research papers Good for web search? MAP assumes user is interested in finding many relevant documents for each query MAP requires many relevance judgments in text collection

When There s only 1 Relevant Document Scenarios: known-item search navigational queries looking for a fact Search Length = Rank of the answer measures a user s effort 8

Mean Reciprocal Rank Consider rank position, K, of first relevant doc Reciprocal Rank score = MRR is the mean RR across multiple queries 1 K

Sec. 8.5.1 Critique of pure relevance Relevance vs Marginal Relevance A document can be redundant even if it is highly relevant Duplicates The same information from different sources Marginal relevance is a better measure of utility for the user But harder to create evaluation set See Carbonell and Goldstein (1998) Using facts/entities as evaluation unit can more directly measure true recall Also related is seeking diversity in first page results See Diversity in Document Retrieval workshops 10

fair fair Good

Discounted Cumulative Gain Popular measure for evaluating web search and related tasks Two assumptions: Highly relevant documents are more useful than marginally relevant document the lower the ranked position of a relevant document, the less useful it is for the user, since it is less likely to be examined

Discounted Cumulative Gain Uses graded relevance as a measure of usefulness, or gain, from examining a document Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks Typical discount is 1/log (rank) With base, the discount at rank 4 is 1/, and at rank 8 it is 1/3

Summarize a Ranking: DCG What if relevance judgments are in a scale of [0,r]? r> Cumulative Gain (CG) at rank n Let the ratings of the n documents be r 1, r, r n (in ranked order) CG = r 1 +r + r n Discounted Cumulative Gain (DCG) at rank n DCG = r 1 + r /log + r 3 /log 3 + r n /log n We may use any base for the logarithm, e.g., base=b 14

Discounted Cumulative Gain DCG is the total gain accumulated at a particular rank p: Alternative formulation: used by some web search companies emphasis on retrieving highly relevant documents

DCG Example 10 ranked documents judged on 0-3 relevance scale: 3,, 3, 0, 0, 1,,, 3, 0 discounted gain: 3, /1, 3/1.59, 0, 0, 1/.59, /.81, /3, 3/3.17, 0 = 3,, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0 DCG: 3, 5, 6.89, 6.89, 6.89, 7.8, 7.99, 8.66, 9.61, 9.61

Summarize a Ranking: NDCG Normalized Cumulative Gain (NDCG) at rank n Normalize DCG at rank n by the DCG value at rank n of the ideal ranking The ideal ranking would first return the documents with the highest relevance level, then the next highest relevance level, etc Compute the precision (at rank) where each (new) relevant document is retrieved => p(1),,p(k), if we have k rel. docs NDCG is now quite popular in evaluating Web search 17

NDCG - Example i 4 documents: d 1, d, d 3, d 4 Document Order Ground Truth Ranking Function 1 Ranking Function r i Document Order r i Document Order 1 d4 d3 d3 d3 d4 d 1 3 d 1 d 1 d4 4 d1 0 d1 0 d1 0 NDCG GT =1.00 NDCG RF1 =1.00 NDCG RF =0.903 DCG GT DCG RF DCG RF log 1 log 3 0 log 1 0 log log 3 log 4 1 0 log log 3 log 4 1 4.6309 4 4.6309 4.619 r i MaxDCG DCGGT 4.6309

Precion-Recall Curve Out of 478 rel docs, we ve got 31 Recall=31/478 Precision@10docs about 5.5 docs in the top 10 docs are relevant Breakeven Point (prec=recall) Mean Avg. Precision (MAP) 008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 1-30, 008 19

Precision What Query Averaging Hides 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0. 0.1 0 0 0.1 0. 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Slide from Doug Oard s presentation, originally from Ellen Voorhees presentation 0