Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Similar documents
Information Retrieval

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Retrieval Evaluation. Hongning Wang

THIS LECTURE. How do we know if our results are any good? Results summaries: Evaluating a search engine. Making our good results usable to a user

Information Retrieval

Information Retrieval

CS6322: Information Retrieval Sanda Harabagiu. Lecture 13: Evaluation

Part 7: Evaluation of IR Systems Francesco Ricci

Information Retrieval. Lecture 7

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Evaluation. David Kauchak cs160 Fall 2009 adapted from:

Informa(on Retrieval

CSCI 5417 Information Retrieval Systems. Jim Martin!

Web Information Retrieval. Exercises Evaluation in information retrieval

Chapter 8. Evaluating Search Engine

Information Retrieval and Web Search

Informa(on Retrieval

This lecture. Measures for a search engine EVALUATING SEARCH ENGINES. Measuring user happiness. Measures for a search engine

Information Retrieval

Information Retrieval

Experiment Design and Evaluation for Information Retrieval Rishiraj Saha Roy Computer Scientist, Adobe Research Labs India

Overview. Lecture 6: Evaluation. Summary: Ranked retrieval. Overview. Information Retrieval Computer Science Tripos Part II.

Information Retrieval

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

Introduction to Information Retrieval

Information Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

Lecture 5: Evaluation

Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University

Modern Retrieval Evaluations. Hongning Wang

Chapter III.2: Basic ranking & evaluation measures

CS47300: Web Information Search and Management

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

EPL660: INFORMATION RETRIEVAL AND SEARCH ENGINES. Slides by Manning, Raghavan, Schutze

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Search Engines. Informa1on Retrieval in Prac1ce. Annota1ons by Michael L. Nelson

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Learning Ranking Functions with Implicit Feedback

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Evaluation of Retrieval Systems

Evaluation of Retrieval Systems

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016

Learning to Rank (part 2)

Ranking and Learning. Table of Content. Weighted scoring for ranking Learning to rank: A simple example Learning to ranking as classification.

Evaluation of Retrieval Systems

WebSci and Learning to Rank for IR

Natural Language Processing and Information Retrieval

Evaluating Machine-Learning Methods. Goals for the lecture

Evaluating Classifiers

Diversification of Query Interpretations and Search Results

Information Retrieval

Evaluating Classifiers

Large-Scale Validation and Analysis of Interleaved Search Evaluation

Search Engines and Learning to Rank

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Informa(on Retrieval

Overview of Information Retrieval and Organization. CSC 575 Intelligent Information Retrieval

Information Retrieval. (M&S Ch 15)

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Information Retrieval CSCI

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Informa(on Retrieval

Evaluation of Retrieval Systems

Classification: Linear Discriminant Functions

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Fall Lecture 16: Learning-to-rank

Part 2: Boolean Retrieval Francesco Ricci

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

An Active Learning Approach to Efficiently Ranking Retrieval Engines

Lizhe Sun. November 17, Florida State University. Ranking in Statistics and Machine Learning. Lizhe Sun. Introduction

Evaluating Machine Learning Methods: Part 1

Introduction to Information Retrieval

ICFHR 2014 COMPETITION ON HANDWRITTEN KEYWORD SPOTTING (H-KWS 2014)

Multimedia Information Systems

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Fast or furious? - User analysis of SF Express Inc

CS54701: Information Retrieval

Adaptive Search Engines Learning Ranking Functions with SVMs

Classification: Feature Vectors

Pattern recognition (4)

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Learning Ranking Functions with SVMs

A Comparative Analysis of Cascade Measures for Novelty and Diversity

Personalized Web Search

Ryen W. White, Matthew Richardson, Mikhail Bilenko Microsoft Research Allison Heath Rice University

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Transcription:

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

Sec. 8.6 Evaluation of a search engine How fast does it index? Number of documents/hour Incremental indexing How large is its doc collection? How fast does it search? How expressive is the query language? User interface design issues This is all good, but it says nothing about the quality of its search 2

Sec. 8.1 User happiness is elusive to measure The key utility measure is user happiness. How satisfied is each user with the obtained results? The most common proxy to measure human satisfaction is relevance of search results to the posed information How do you measure relevance? 3

Why do we need system evaluation? How do we know which of the already introduced techniques are effective in which applications? Should we use stop lists? Should we stem? Should we use inverse document frequency weighting? How can we claim to have built a better search engine for a document collection? 4

Measuring relevance Relevance measurement requires 3 elements: 1. A benchmark doc collection 2. A benchmark suite of information needs 3. A usually binary assessment of either Relevant or Nonrelevant for each information needs and each document Some work on more-than-binary, but not the standard 5

So you want to measure the quality of a new search algorithm Benchmark documents Benchmark query suite Judgments of document relevance for each query docs Relevance judgement sample queries 6

Relevance judgments Binary (relevant vs. non-relevant) in the simplest case, more nuanced (0, 1, 2, 3 ) in others What are some issues already? Cost of getting these relevance judjements 7

Crowd source relevance judgments? Present query-document pairs to low-cost labor on online crowd-sourcing platforms Hope that this is cheaper than hiring qualified assessors Lots of literature on using crowd-sourcing for such tasks Main takeaway you get some signal, but the variance in the resulting judgments is very high 8

Sec. 8.1 Evaluating an IR system Note: user need is translated into a query Relevance is assessed relative to the user need, not the query E.g., Information need: My swimming pool bottom is becoming black and needs to be cleaned. Query: pool cleaner Assess whether the doc addresses the underlying need, not whether it has these words 9

Sec. 8.5 What else? Still need test queries Must be germane to docs available Must be representative of actual user needs Random query terms from the documents generally not a good idea Sample from query logs if available Classically (non-web) Low query rates not enough query logs Experts hand-craft user needs 10

Sec. 8.5 Some public test Collections Typical TREC 11

Unranked retrieval evaluation: Precision and Recall Sec. 8.3 Precision: P(relevant retrieved) fraction of retrieved docs that are relevant Recall: P(retrieved relevant) fraction of relevant docs that are retrieved Relevant Nonrelevant Retrieved tp fp Not Retrieved fn tn Precision P = tp/(tp + fp) Recall R = tp/(tp + fn) 13

Accuracy measure for evaluation? Accuracy: fraction of classifications that are correct evaluation measure in machine learning classification works The accuracy of an engine: (tp + tn) / ( tp + fp + fn + tn) Given a query, an engine classifies each doc as Relevant or Nonrelevant Why is this not a very useful evaluation measure in IR? 14

Sec. 8.3 Why not just use accuracy? How to build a 99.9999% accurate search engine on a low budget. The snoogle search engine below always returns 0 results ( No matching results found ), regardless of the query Since many more non-relevant docs than relevant ones Search for: 0 matching results found. People want to find something and have a certain tolerance for junk. 15

Sec. 8.3 Precision/Recall Retrieving all docs for all queries! High recall but low precision Recall is a non-decreasing function of the number of docs retrieved In a good system, precision decreases as either the number of docs retrieved or recall increases This is not a theorem, but a result with strong empirical confirmation 16

Sec. 8.3 A combined measure: F Combined measure: F measure allows us to trade off precision against recall weighted harmonic mean of P and R β 2 = 1 α α F 2 1 ( 2 1 1 (1 ) P R 1) PR P R What value range of weights recall higher than precision? 17

A combined measure: F People usually use balanced F ( = 1 or = ½) F = F β=1 F = 2PR P + R harmonic mean of P and R: 1 F = 1 2 1 P + 1 R 18

Why harmonic mean Why don t we use a different mean of P and R as a measure? e.g., the arithmetic mean The simple (arithmetic) mean is 50% for return-everything search engine, which is too high. Desideratum: Punish really bad performance on either precision or recall. Taking the minimum achieves this. But minimum is not smooth and hard to weight. F (harmonic mean) is a kind of smooth minimum. 19

Sec. 8.3 F 1 and other averages Combined Measures 100 80 60 40 20 Minimum Maximum Arithmetic Geometric Harmonic 0 0 20 40 60 80 100 Precision (Recall fixed at 70%) 20 Harmonic mean is a conservative average We can view the harmonic mean as a kind of soft minimum

Sec. 8.4 Evaluating ranked results Precision, recall and F are measures for (unranked) sets. We can easily turn set measures into measures of ranked lists. Evaluation of ranked results: Taking various numbers of top returned docs (recall levels) Sets of retrieved docs are given by the top k retrieved docs. Just compute the set measure for each prefix : the top 1, top 2, top 3, top 4, and etc results Doing this for precision and recall gives you a precision-recall curve 21

Rank-Based Measures Binary relevance Precision-Recall curve Precision@K (P@K) Mean Average Precision (MAP) Mean Reciprocal Rank (MRR) Multiple levels of relevance Normalized Discounted Cumulative Gain (NDCG) 22

Sec. 8.4 A precision-recall curve 1.0 0.8 Precision 0.6 0.4 0.2 23 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall

An interpolated precision-recall curve 1.0 Precision 0.8 0.6 0.4 0.2 p interp r = max r r p(r ) 25 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall

Sec. 8.4 Averaging over queries Precision-recall graph for one query It isn t a very sensible thing to look at Average performance over a whole bunch of queries. But there s a technical issue: Precision-recall: only place some points on the graph How do you determine a value (interpolate) between the points? 26

Binary relevance evaluation Graphs are good, but people want summary measures! 11-point interpolated average precision Precision at fixed retrieval level MAP R-precision 27

Sec. 8.4 11-point interpolated average precision The standard measure in the early TREC competitions Precision at 11 levels of recall varying from 0 to 1 by tenths of the docs using interpolation and average them Evaluates performance at all recall levels (0, 0.1, 0.2,,1) 28

Sec. 8.4 Typical (good) 11 point precisions SabIR/Cornell 8A1 11pt precision from TREC 8 (1999) 1 0.8 Precision 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Recall 29

Precision-at-k Precision-at-k: Precision of top k results Set a rank threshold K Ignores documents ranked lower than K Perhaps appropriate for most of web searches people want good matches on the first one or two results pages Does not need any estimate of the size of relevant set But: averages badly and has an arbitrary parameter of k 30

Precision-at-k Compute % relevant in top K Examples Prec@3 of 2/3 Prec@4 of 2/4 Prec@5 of 3/5 In similar fashion we have Recall@K 31

Average precision Consider rank position of each relevant doc K 1, K 2, K R Compute Precision@K for each K 1, K 2, K R Average precision = average of P@K (for K 1, K 2, K R ) 1 1 3 1 Ex: has AvgPrec of 76 2 3 3 5 0. 32

Sec. 8.4 Mean Average Precision (MAP) MAP is Average Precision across multiple queries/rankings Mean Average Precision (MAP) Average precision is obtained for the top k docs, each time a relevant doc is retrieved MAP for query collection is arithmetic average Macro-averaging: each query counts equally 33

Average precision: example 34

MAP: example 35

MAP Q: set of information needs Set of relevant docs to q j Q: d j,1, d j,2,, d j,k R jk : set of ranked retrieval results from the top until reaching d j,k Q 1 m j Precision(R kj) MAP Q = 1 Q j=1 m j i=1 36

MAP Now perhaps most commonly used measure in research papers Good for web search? MAP assumes user is interested in finding many relevant docs for each query MAP requires many relevance judgments in text collection 37

Beyond binary relevance fair fair Good 38

Discounted Cumulative Gain Popular measure for evaluating web search and related tasks Two assumptions: Highly relevant docs are more useful than marginally relevant docs The lower ranked position of a relevant doc, the less useful it is for the user 39

Discounted Cumulative Gain Uses graded relevance as a measure of usefulness More than two levels (i.e. relevant and non-relevant) Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks Typical discount is 1/log (rank) With base 2, the discount at rank 4 is 1/2, and at rank 8 it is 1/3 40

Summarize a Ranking: DCG Cumulative Gain (CG) at rank n Let the ratings of the n docs be r 1, r 2, r n (in ranked order) CG = r 1 +r 2 + r n Discounted Cumulative Gain (DCG) at rank n DCG = r 1 + r 2 /log 2 2 + r 3 /log 2 3 + r n /log 2 n We may use any base for the logarithm 41

Discounted Cumulative Gain DCG is the total gain accumulated at a particular rank p: Alternative formulation: used by some web search companies emphasis on retrieving highly relevant documents 42

DCG Example 10 ranked documents judged on 0-3 relevance scale: 3 2 3 0 0 1 2 discounted gain: 3 2/1 3/1.59 0 0 1/2.59 2/2.81 = 3 2 1.89 0 0 0.39 0.71 DCG: 3 5 6.89 6.89 6.89 7.28 7.99 43

Summarize a Ranking: NDCG NDCG(q,k) is computed over the k top search results (similar to p@k) NDCG normalizes DCG at rank k by the DCG value at rank k of the ideal ranking Ideal ranking: first returns docs with the highest relevance level, then the next highest relevance level, etc Normalization useful for contrasting queries with varying numbers of relevant results NDCG is now quite popular in evaluating Web search 44

NDCG - Example i 4 documents: d 1, d 2, d 3, d 4 Document Order Ground Truth Ranking Function 1 Ranking Function 2 r i Document Order r i Document Order 1 d4 2 d3 2 d3 2 2 d3 2 d4 2 d2 1 3 d2 1 d2 1 d4 2 4 d1 0 d1 0 d1 0 NDCG GT =1.00 NDCG RF1 =1.00 NDCG RF2 =0.9203 DCG GT 2 DCG RF DCG RF 2 log 2 2 1 log 2 3 0 log 2 1 0 2 log 2 2 log 2 3 log 2 4 1 2 0 2 log 2 2 log 2 3 log 2 4 1 2 2 4.6309 4 4.6309 4.2619 r i MaxDCG DCGGT 4.6309 45

What if the results are not in a list? Suppose there s only one Relevant Document Scenarios: known-item search navigational queries looking for a fact Search duration ~ Rank of the answer measures a user s effort 46

Mean Reciprocal Rank Consider rank position, K, of first relevant doc Could be only clicked doc Reciprocal Rank score = 1 K MRR is the mean RR across multiple queries 47

Human judgments are Expensive Inconsistent Between raters Over time Decay in value as documents/query mix evolves Not always representative of real users Rating vis-à-vis query, vs underlying need So what alternatives do we have? 49

50 Using user Clicks

What do clicks tell us? # of clicks received 51 Strong position bias, so absolute click rates unreliable

Relative vs absolute ratings User s click sequence Hard to conclude Result1 > Result3 Probably can conclude Result3 > Result2 52

Pairwise relative ratings Pairs of the form: DocA better than DocB for a query Doesn t mean that DocA relevant to query Now, rather than assess a rank-ordering wrt per-doc relevance assessments Assess in terms of conformance with historical pairwise preferences recorded from user clicks 53

Sec. 8.6.3 A/B testing: refining a deployed system Purpose:Test a single innovation Prerequisite:You have a large search engine up and running. Method: Divert a small proportion of traffic (e.g., 1%) to the new system that includes the innovation So most users use old system 54

Sec. 8.6.3 A/B testing at web search engines Have most users use old system Divert a small proportion of traffic (e.g., 1%) to an experiment to evaluate an innovation Full page experiment Interleaved experiment 55

Comparing two rankings via clicks (Joachims 2002) Query: [support vector machines] Ranking A Kernel machines SVM-light Lucent SVM demo Royal Holl. SVM SVM software SVM tutorial Ranking B Kernel machines SVMs Intro to SVMs Archives of SVM SVM-light SVM software 56

Interleave the two rankings Kernel machines Kernel machines SVMs This interleaving starts with B SVM-light Intro to SVMs Lucent SVM demo Archives of SVM Royal Holl. SVM 57 SVM-light

Remove duplicate results Kernel machines Kernel machines SVMs SVM-light Intro to SVMs Lucent SVM demo Archives of SVM Royal Holl. SVM 58 SVM-light

Count user clicks Ranking A: 3 Ranking B: 1 Kernel machines Kernel machines SVMs SVM-light Intro to SVMs Lucent SVM demo Archives of SVM Royal Holl. SVM A, B Clicks A A 59 SVM-light

Interleaved ranking Present interleaved ranking to users Start randomly with ranking A or ranking B to evens out presentation bias Count clicks on results from A versus results from B Better ranking will (on average) get more clicks 60

Facts/entities (what happens to clicks?) 61

Comparing two rankings to a baseline ranking Given a set of pairwise preferences P We want to measure two rankings A and B Define a proximity measure between A and P (and likewise, between B and P) Proximity measure should reward agreements with P and penalize disagreements Want to declare the ranking with better proximity to be the winner 62

Kendall tau distance X: # of agreements between a ranking (say A) and P Y: # of disagreements Then the Kendall tau distance between A and P is X Y Example: X + Y P = {(1,2), (1,3), (1,4), (2,3), (2,4), (3,4))} A=(1,3,2,4) Then X=5,Y=1 63

Resources for this lecture IIR 8 MIR Chapter 3 MG 4.5 Carbonell and Goldstein 1998. The use of MMR, diversitybased reranking for reordering documents and producing summaries. SIGIR 21. 95