Information Retrieval

Similar documents
Information Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007

Part 7: Evaluation of IR Systems Francesco Ricci

THIS LECTURE. How do we know if our results are any good? Results summaries: Evaluating a search engine. Making our good results usable to a user

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Retrieval Evaluation. Hongning Wang

Information Retrieval

CS6322: Information Retrieval Sanda Harabagiu. Lecture 13: Evaluation

Information Retrieval. Lecture 7

Information Retrieval

Web Information Retrieval. Exercises Evaluation in information retrieval

CSCI 5417 Information Retrieval Systems. Jim Martin!

Evaluation. David Kauchak cs160 Fall 2009 adapted from:

Informa(on Retrieval

Introduction to Information Retrieval

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Chapter 8. Evaluating Search Engine

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

Overview. Lecture 6: Evaluation. Summary: Ranked retrieval. Overview. Information Retrieval Computer Science Tripos Part II.

Informa(on Retrieval

This lecture. Measures for a search engine EVALUATING SEARCH ENGINES. Measuring user happiness. Measures for a search engine

Information Retrieval

Chapter III.2: Basic ranking & evaluation measures

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Evaluation of Retrieval Systems

Information Retrieval

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

Evaluation of Retrieval Systems

Information Retrieval and Web Search

Multimedia Information Retrieval

Lecture 5: Evaluation

Informa(on Retrieval

Chapter 6 Evaluation Metrics and Evaluation

Informa(on Retrieval

Retrieval Evaluation

Information Retrieval. (M&S Ch 15)

Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Evaluation of Retrieval Systems

Experiment Design and Evaluation for Information Retrieval Rishiraj Saha Roy Computer Scientist, Adobe Research Labs India

CS47300: Web Information Search and Management

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

Information Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007

Evaluation of Retrieval Systems

Diversification of Query Interpretations and Search Results

Information Retrieval. Lecture 9 - Web search basics

Part 11: Collaborative Filtering. Francesco Ricci

Information Retrieval

Building Test Collections. Donna Harman National Institute of Standards and Technology

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Information Retrieval. Lecture 11 - Link analysis

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Performance Evaluation

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Introduction to Information Retrieval. (COSC 488) Spring Nazli Goharian. Course Outline

Modern Information Retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

CS506/606 - Topics in Information Retrieval

Manning Chapter: Text Retrieval (Selections) Text Retrieval Tasks. Vorhees & Harman (Bulkpack) Evaluation The Vector Space Model Advanced Techniques

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Examining the Limits of Crowdsourcing for Relevance Assessment

Information Retrieval. Techniques for Relevance Feedback

modern database systems lecture 4 : information retrieval

Part 11: Collaborative Filtering. Francesco Ricci

Passage Retrieval and other XML-Retrieval Tasks. Andrew Trotman (Otago) Shlomo Geva (QUT)

A Methodology for End-to-End Evaluation of Arabic Document Image Processing Software

Evaluating Classifiers

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

A Comparative Analysis of Cascade Measures for Novelty and Diversity

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

EPL660: INFORMATION RETRIEVAL AND SEARCH ENGINES. Slides by Manning, Raghavan, Schutze

Evaluating Classifiers

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Relevance in XML Retrieval: The User Perspective

Northeastern University in TREC 2009 Million Query Track

Information Retrieval. Lecture 3: Evaluation methodology

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Introduction to Information Retrieval

Information Retrieval (Part 1)

From Passages into Elements in XML Retrieval

The TREC 2005 Terabyte Track

Information Retrieval and Web Search Engines

WebSci and Learning to Rank for IR

PV211: Introduction to Information Retrieval

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

CS54701: Information Retrieval

Information Retrieval. Lecture 10 - Web crawling

Automatic Generation of Query Sessions using Text Segmentation

Search Engines and Learning to Rank

How many performance measures to evaluate Information Retrieval Systems?

Chapter 9. Classification and Clustering

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Overview of Information Retrieval and Organization. CSC 575 Intelligent Information Retrieval

Introduction to Information Retrieval

Informativeness for Adhoc IR Evaluation:

Natural Language Processing and Information Retrieval

An Evaluation Method of Web Search Engines Based on Users Sense

Reducing Redundancy with Anchor Text and Spam Priors

Assignment No. 1. Abdurrahman Yasar. June 10, QUESTION 1

Transcription:

Information Retrieval Lecture 7 - Evaluation in Information Retrieval Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 29

Introduction Framework for the evaluation of an IR system: test collection consisting of (i) a document collection, (ii) a test suite of information needs and (iii) a set of relevance judgements for each doc-query pair gold-standard judgement of relevance classification of a document either as relevant or as irrelevant wrt an information need NB: the test collection must cover at least 50 information needs 2/ 29

Overview Standard test collections Evaluation for unranked retrieval Evaluation for ranked retrieval Assessing relevance System quality and user utility Conclusion 3/ 29

Standard test collections Standard test collection Cranfield collection: 1398 abstracts of journal articles about aerodynamics, gathered in UK in the 1950s, plus 255 queries and exhaustive relevance judgements TREC (Text REtrieval Conference): collection maintained by the US National Institute of Standards and Technology since 1992 TREC Ad Hoc Track: test collection used for 8 evaluation campaigns led from 1992 to 1999, contains 1.89 million documents and relevance judgements for 450 topics TREC 6-8: test collection providing 150 information needs over 528000 newswires current state-of-the-art test collection note that the relevance judgements are not exhaustive 4/ 29

Standard test collections Standard test collection (continued) GOV2: collection also maintained by the NIST, containing 25 millions of webpages (larger than other test collections, but smaller than current collection supported by WWW search engines) NTCIR (Nii Test Collection for IR systems): various test collections focussing on East Asian languages, mainly used for cross-language IR CLEF (Cross Language Evaluation Forum): collection focussing on European languages http://www.clef-campaign.org REUTERS: Reuters 21578 and REUTERS RCV1 containing respectively 21 578 newswire articles and 806 791 documents, mainly used for text classification 5/ 29

Overview Evaluation for unranked retrieval Standard test collections Evaluation for unranked retrieval Evaluation for ranked retrieval Assessing relevance System quality and user utility Conclusion 6/ 29

Evaluation for unranked retrieval Evaluation for unranked retrieval: basics 2 basic effectiveness measures: precision and recall Precision = Recall = #relevant retrieved #retrieved #relevant retrieved #relevant In other terms: Relevant Not relevant Retrieved true positive false positive Not retrieved false negative true negative Precision = tp tp + fp Recall = tp tp + fn 7/ 29

Evaluation for unranked retrieval Evaluation for unranked retrieval: basics (continued) Accuracy: proportion of the classification relevant/not relevant that is correct accuracy = tp + tn tp + fp + tn + fn Problem: 99.9 % of the collection is usually not relevant to a given query (potential high rate of false positives) Recall and precision are inter-dependent measures: precision usually decreases while the number of retrieved documents increases recall increases while the number of retrieved documents increases 8/ 29

Evaluation for unranked retrieval Evaluation for unranked retrieval: F-measure Measure relating precision and recall: F = α 1 Pr 1 1 + (1 α) Re When the coefficient α = 0.5: F = 2 Pr Re Pr + Re Uses a harmonic mean rather than an arithmetic one for dealing with extreme values 9/ 29

Overview Evaluation for ranked retrieval Standard test collections Evaluation for unranked retrieval Evaluation for ranked retrieval Assessing relevance System quality and user utility Conclusion 10/ 29

Evaluation for ranked retrieval Evaluation for ranked retrieval NB1: precision, recall and F-measure are set-based measures (order of documents not taken into account) NB2: if we consider the first k retrieved documents, we can compute the precision and recall values we can plot the relation between precision and recall for each value of k if the (k+1) st is not relevant recall is the same, but precision decreases if the (k+1) st is relevant recall and precision increase 11/ 29

Evaluation for ranked retrieval Evaluation for ranked retrieval (continued) Precision-recall curve: 12/ 29

Evaluation for ranked retrieval Evaluation for ranked retrieval (continued) Precision-recall curve: Interpolation of the precision (smoothing): P inter (r) = max r rp(r ) 12/ 29

Evaluation for ranked retrieval Ranked retrieval: efficiency measures 11-point interpolated average precision: For each information need, the value P inter is measured for the 11 recall values 0.0, 0.1, 0.2,...1.0 The arithmetic mean of P inter for a given recall value over the information needs is then computed 1 Average 11 point precision/recall graph (from Manning et al, 2008) 0.8 0.6 Precision 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Recall 13/ 29

Evaluation for ranked retrieval Ranked retrieval: efficiency measures (continued) Mean Average Precision (MAP): For an information need, the average precision is the arithmetic mean of the precisions for the set of top k documents retrieved after each relevant document is retrieved q j Q : information need {d 1...d j } : relevant documents for q j R jk : set of ranked retrieved document from top to d k MAP(Q) = 1 Q Q j=1 1 m j m j Precision(R jk ) k=1 NB: when d l (1 l j) is not retrieved, Precision(R jl ) = 0 14/ 29

Evaluation for ranked retrieval Ranked retrieval: efficiency measures (continued) Precision at k: For www search engines, we are interested in the proportion of good results among the k first answers (say the first 3 pages) precision at a fixed level Pros : does not need an estimate of the set of relevant documents Cons : unstable measure, does not average well 15/ 29

Evaluation for ranked retrieval Ranked retrieval: efficiency measures (continued) R-precision: Alternative for precision at k. R-precision refers to the best precision on the precision/recall curve when considering a set of rel relevant documents and looking at the rel first answers Pr = r rel Re = r rel where r is the number of relevant retrieved documents NB: also called break-even point 16/ 29

Evaluation for ranked retrieval Ranked retrieval: efficiency measures (continued) Normalized Discounted Cumulative Gain (NDCG): Evaluation made for the top k results where: NDCG(Q, k) = 1 Q Q Z k j=1 k m=1 2 R(j,m) 1 log(1 + m) R(j, d) is the score given by assessors to document d for query j (NB: non-binary notion of relevance) Z k is a normalization factor (perfect ranking at k = 1) 17/ 29

Overview Assessing relevance Standard test collections Evaluation for unranked retrieval Evaluation for ranked retrieval Assessing relevance System quality and user utility Conclusion 18/ 29

Assessing relevance Assessing relevance How good is an IR system at satisfaying an information need? Needs an agreement between judges computable via the kappa statistic: kappa = P(A) P(E) 1 P(E) where: P(A) is the proportion of agreements within the judgements P(E) is the proportion of expected agreements In case of binary decisions (e.g. relevant vs not relevant), P(E) = 0.5 Otherwise, a marginal statistic is used (see below) 19/ 29

Assessing relevance Assessing relevance: an example Consider the following judgements (from Manning et al., 2008): Judge 2 Yes No Total Judge 1 Yes 300 20 320 No 10 70 80 Total 310 90 400 20/ 29

Assessing relevance Assessing relevance: an example Consider the following judgements (from Manning et al., 2008): Judge 2 Yes No Total Judge 1 Yes 300 20 320 No 10 70 80 Total 310 90 400 P(A) = 370 400 P(rel) = 320 + 310 800 P(notrel) = 80 + 90 800 20/ 29

Assessing relevance Assessing relevance: an example Consider the following judgements (from Manning et al., 2008): Judge 2 Yes No Total Judge 1 Yes 300 20 320 No 10 70 80 Total 310 90 400 P(A) = 370 400 P(rel) = 320 + 310 800 P(notrel) = 80 + 90 800 P(E) = P(rel) 2 + P(notrel) 2 k = 0.776 20/ 29

Assessing relevance Assessing relevance (continued) Interpretation of the kappa statistic k: k 0.8 good agreement 0.67 k < 0.8 fair agreement k < 0.67 bad agreement Note that the kappa statistic can be negative if the agreements between judgements are worse than random In case of large variations between judgements, one can choose an assessor as a gold-standard considerable impact on the absolute assessment little impact on the relative assessment 21/ 29

Assessing relevance Assessing relevance: limits Limit of the assessment: does the relevance of a document change in context (i.e. when the document appears after some other document)? Case of www search engines: duplicates How to evaluate the marginal relevance? evaluating diversity or novelty. 22/ 29

Overview System quality and user utility Standard test collections Evaluation for unranked retrieval Evaluation for ranked retrieval Assessing relevance System quality and user utility Conclusion 23/ 29

System quality and user utility System quality and user utility Ultimate interest: how satisfied is the user with the results the system gives for each of its information needs? Evaluation criteria for an IR system: fast indexing fast searching expressivity of the query language size of hte collection supported user interface (clearness of the input form and of the output list, e.g. snippets, etc) 24/ 29

System quality and user utility System quality and user utility (continued) Quantifying user happiness? for www search engines: do the users find the information they are looking for? can be quantified by evaluating the proportion of users getting back to the engine for intranet search engines: this efficiency can be evaluated by the time spent searching for a given piece of information general case: user studies evaluating the adequacy of the search engine with the expected usage (ecommerce, etc) 25/ 29

System quality and user utility Refining a deployed system A/B test: Standard test for evaluating changes, particularly used for www search engines concerns the modification of 1 functionality between 2 variants A and B of a system 10 % of the users are blindly redirected to the new variant B the results of the two systems A and B are then studied Example: ranking algorithm comparison of the result clicking between users of system A and those of system B 26/ 29

Overview Conclusion Standard test collections Evaluation for unranked retrieval Evaluation for ranked retrieval Assessing relevance System quality and user utility Conclusion 27/ 29

Conclusion Conclusion Standard test collections: Cranfield, TREC, CLEF, etc. Evaluation for unranked retrieval: precision, recall and f-measure Evaluation for ranked retrieval: 11-point interpolation, Mean Average Precision, R-precision and Normalized Discounted Cumulative Gain Assessing relevance: kappa statistic Discussion about system quality, user utility and system refining 28/ 29

Conclusion References C. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval http://nlp.stanford.edu/ir-book/pdf/ chapter08-evaluation.pdf Cyril Cleverdon The significance of the Cranfield tests on index languages (1991) http://elvis.slis.indiana.edu/irpub/sigir/ 1991/pdf1.pdf Chris Buckley and Ellen Voorhees Evaluating evaluation measure stability (2000) http://citeseer.ist.psu.edu/buckley00evaluating.html Chris Buckley the trec eval software http://trec.nist.gov/trec_eval/ 29/ 29