Web Information Retrieval. Exercises Evaluation in information retrieval

Similar documents
THIS LECTURE. How do we know if our results are any good? Results summaries: Evaluating a search engine. Making our good results usable to a user

Information Retrieval. Lecture 7

Evaluation. David Kauchak cs160 Fall 2009 adapted from:

Information Retrieval

CSCI 5417 Information Retrieval Systems. Jim Martin!

Part 7: Evaluation of IR Systems Francesco Ricci

CS6322: Information Retrieval Sanda Harabagiu. Lecture 13: Evaluation

Information Retrieval

Informa(on Retrieval

Overview. Lecture 6: Evaluation. Summary: Ranked retrieval. Overview. Information Retrieval Computer Science Tripos Part II.

Introduction to Information Retrieval

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Lecture 5: Evaluation

Retrieval Evaluation. Hongning Wang

Informa(on Retrieval

This lecture. Measures for a search engine EVALUATING SEARCH ENGINES. Measuring user happiness. Measures for a search engine

Information Retrieval

Information Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007

Information Retrieval

Chapter 8. Evaluating Search Engine

Performance Evaluation

Experiment Design and Evaluation for Information Retrieval Rishiraj Saha Roy Computer Scientist, Adobe Research Labs India

Evaluating Classifiers

Information Retrieval and Web Search

Chapter III.2: Basic ranking & evaluation measures

Evaluating Classifiers

Information Retrieval

Evaluation of Retrieval Systems

Course structure & admin. CS276A Text Information Retrieval, Mining, and Exploitation. Dictionary and postings files: a fast, compact inverted index

Search Engines and Learning to Rank

Overview of Information Retrieval and Organization. CSC 575 Intelligent Information Retrieval

Informa(on Retrieval

EPL660: INFORMATION RETRIEVAL AND SEARCH ENGINES. Slides by Manning, Raghavan, Schutze

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

Natural Language Processing and Information Retrieval

Informa(on Retrieval

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

MEASURING CLASSIFIER PERFORMANCE

Measuring Intrusion Detection Capability: An Information- Theoretic Approach

CS145: INTRODUCTION TO DATA MINING

CS47300: Web Information Search and Management

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Information Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

DD2475 Information Retrieval Lecture 5: Evaluation, Relevance Feedback, Query Expansion. Evaluation

Introduction to Information Retrieval

Evaluating Classifiers

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Representation of Documents and Infomation Retrieval

Multimedia Information Retrieval

Chapter 6 Evaluation Metrics and Evaluation

Feature Subset Selection using Clusters & Informed Search. Team 3

The Security Role for Content Analysis

ABBYY Smart Classifier 2.7 User Guide

Evaluating Machine-Learning Methods. Goals for the lecture

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Recommender Systems 6CCS3WSN-7CCSMWAL

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

AXIOMS OF AN IMPERATIVE LANGUAGE PARTIAL CORRECTNESS WEAK AND STRONG CONDITIONS. THE AXIOM FOR nop

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

CS347. Lecture 2 April 9, Prabhakar Raghavan

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

CS 468 Data-driven Shape Analysis. Shape Descriptors

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Manning Chapter: Text Retrieval (Selections) Text Retrieval Tasks. Vorhees & Harman (Bulkpack) Evaluation The Vector Space Model Advanced Techniques

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Evaluation of Retrieval Systems

Chapter 6: Information Retrieval and Web Search. An introduction

Test design techniques

Pattern recognition (4)

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

Classification Part 4

Modern Information Retrieval

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Chapter 3: Supervised Learning

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

II TupleRank: Ranking Discovered Content in Virtual Databases 2

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Information Retrieval. Lecture 3: Evaluation methodology

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

A Bagging Method using Decision Trees in the Role of Base Classifiers

Information Retrieval. (M&S Ch 15)

On the automatic classification of app reviews

University of Sheffield, NLP. Chunking Practical Exercise

Improving the Efficiency of Multi-site Web Search Engines

Introduction to Information Retrieval

Introduction to Information Retrieval

Retrieval Evaluation

Evaluation of Retrieval Systems

Artificial Intelligence. Programming Styles

Use of Synthetic Data in Testing Administrative Records Systems

Transcription:

Web Information Retrieval Exercises Evaluation in information retrieval

Evaluating an IR system Note: information need is translated into a query Relevance is assessed relative to the information need not the query E.g., Information need: I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine. Query: wine red white heart attack effective Evaluate whether the doc addresses the information need, not whether it has these words

Standard relevance benchmarks TREC - National Institute of Standards and Testing (NIST) has run large IR test bed for many years Reuters and other benchmark doc collections used Retrieval tasks specified sometimes as queries Human experts mark, for each query and for each doc, Relevant or Irrelevant or at least for subset of docs that some system returned for that query

Unranked results We next assume that the search engine returns a set of documents as potentially relevant Does not perform any ranking We want to assess the quality of these results

Precision and Recall Precision: fraction of retrieved docs that are relevant = P(relevant retrieved) Recall: fraction of relevant docs that are retrieved = P(retrieved relevant) Relevant Irrelevant Retrieved tp fp Not Retrieved fn tn Precision P = tp/(tp + fp) Recall R = tp/(tp + fn)

A combined measure: F Combined measure that assesses this tradeoff is F measure (weighted harmonic mean): F= 1 α 1 P +(1 α ) 1 R = ( β2 +1 )PR β 2 P+R People usually use balanced F 1 measure i.e., with = 1 or = ½

Accuracy Given a query an engine classifies each doc as Relevant or Irrelevant. Accuracy of an engine: the fraction of these classifications that is correct. The accuracy of an engine: the fraction of these classifications that are correct (tp + tn) / ( tp + fp + fn + tn) Accuracy is a commonly used evaluation measure in machine learning classification work Q.: Why is this not a very useful evaluation measure in IR?

Why not just use accuracy? How to build a 99.9999% accurate search engine on a low budget. Search for: 0 matching results found. People doing information retrieval want to find something and have a certain tolerance for junk.

Precision/Recall Q.: An IR system returns 8 relevant documents, and 10 nonrelevant documents. There are a total of 20 relevant documents in the collection. What is the precision of the system on this search, and what is its recall?

Precision/Recall Q.: An IR system returns 8 relevant documents, and 10 nonrelevant documents. There are a total of 20 relevant documents in the collection. What is the precision of the system on this search, and what is its recall? A.: By applying definitions: Precision = 8/18 = 0.44; Recall = 8/20 =0.4.

Precision/Recall Q.: The balanced F measure (a.k.a. F1 ) is defined as the harmonic mean of precision and recall. What is the advantage of using the harmonic mean rather than averaging (using the arithmetic mean)?

Precision/Recall Q.: The balanced F measure (a.k.a. F1 ) is defined as the harmonic mean of precision and recall. What is the advantage of using the harmonic mean rather than averaging (using the arithmetic mean)? A.: Since arithmetic mean is closer to the highest of the two values of precision and recall, it is not a good representative of both values whereas Harmonic mean, on the other hand is closer to min(precision, Recall). In case when all documents are returned, Recall is 1 and Arithmetic mean is more than 0.5 though the precision is very low and the purpose of the search engine is not served effectively.

Precision/Recall Q.: Derive the equivalence between the two formulas for F measure, given that α = 1/( β 2 + 1).

Ranked retrieval performance

Interpolated precision If you can increase precision by increasing recall, then you should get to count that So you take the max of precisions to right of value

11-point interpolated average precision 11-point interpolated average precision The standard measure in the TREC competitions: you take the precision at 11 levels of recall varying from 0 to 1 by tenths of the documents, using interpolation (the value for 0 is always interpolated!), and average them

Ranked retrieval results Q.: What are the possible values for interpolated precision at a recall level of 0?

Ranked retrieval results Q.: What are the possible values for interpolated precision at a recall level of 0? A.: any value between 0 and 1

Ranked retrieval results Q.: Must there always be a break-even point between precision and recall? Either show there must be or give a counter-example.

Ranked retrieval results Q.: Consider a ranked list of results. Must there always be a break-even point between precision and recall? Either show there must be or give a counter-example. A.: precision = recall iff fp = fn or tp=0. If the highest ranked element is not relevant, then tp=0 and that is a trivial break-even point. If the highest ranked element is relevant: The number of false positives increases as you go down the list and the number of false negatives decreases. As you go down the list, if the item is R, then fn decreases and fp does not change. If the item is N, then fp increases and fn does not change. At the beginning of the list fp<fn. At the end of the list fp>fn. Thus, there has to be a break-even point.

Ranked retrieval results Q.: What is the relationship between the value of F 1 and the break-even point?

Ranked retrieval results Q.: What is the relationship between the value of F 1 and the break-even point? A.: At the break-even point: F 1 = P = R.

Ranked retrieval results Q.: Consider an information need for which there are 4 relevant documents in the collection. Contrast two systems run on this collection. Their top 10 results are judged for relevance as follows (the leftmost item is the top ranked search result): System 1 R N R N N N N N R R System 2 N R N N R R R N N N a. What is the MAP of each system? Which has a higher MAP? b. Does this result intuitively make sense? What does it say about what is important in getting a good MAP score? c. What is the R-precision of each system? (Does it rank the systems the same as MAP?)

Ranked retrieval results A.: a. MAP(System 1) = (1/4)*(1+(2/3)+(3/9)+(4/10)) = 0.6. MAP(System 2) = (1/4)*(1/2 + 2/5 + 3/6 + 4/7) = 0.493 System 1 has a higher average precision b. For a good MAP score, it is essential to have more relevant documents in the first few (3-5) retrieved ones c. R-Precision(system 1) = 1/2 R-Precision(system 2) =1/4. This ranks the systems the same as MAP.

Ranked retrieval results/1 Q.: The following list of R s and N s represents relevant (R) and nonrelevant (N) returned documents in a ranked list of 20 documents retrieved in response to a query from a collection of 10,000 documents. The top of the ranked list (the document the system thinks is most likely to be relevant) is on the left of the list. This list shows 6 relevant documents. Assume that there are 8 relevant documents in total in the collection. R R N N N N N N R N R N N N R N N N N R a. What is the precision of the system on the top 20? b. What is the F 1 on the top 20? c. What is the interpolated precision of the system at 20% recall? d. What is the interpolated precision of the system at 25% recall?

Ranked retrieval results/2 A.: The answers to a and b are straightforward. The answer to c is also simple: it suffices to observe that the first two documents are relevent, so they correspond to 25% recall, since we have a total number of 8 relevant documents. As a consequence, interpolated precision at 20% is 1. As for question d, you need to consider the values of precision for all values of recall exceeding 25% and pick the max R R N N N N N N R N R N N N R N N N N R a. What is the precision of the system on the top 20? b. What is the F 1 on the top 20? c. What is the interpolated precision of the system at 20% recall? d. What is the interpolated precision of the system at 25% recall?