Chapter III.2: Basic ranking & evaluation measures

Similar documents
Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Chapter III: Ranking Principles. Information Retrieval & Data Mining Universität des Saarlandes, Saarbrücken Wintersemester 2013/14

Information Retrieval. (M&S Ch 15)

Information Retrieval

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

Information Retrieval

Information Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007

Web Information Retrieval. Exercises Evaluation in information retrieval

Chapter 8. Evaluating Search Engine

Chapter 6: Information Retrieval and Web Search. An introduction

Retrieval Evaluation. Hongning Wang

Introduction to Information Retrieval

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Information Retrieval

Text Retrieval an introduction

Data Modelling and Multimedia Databases M

Introduction to Information Retrieval

Models for Document & Query Representation. Ziawasch Abedjan

Digital Libraries: Language Technologies

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Document indexing, similarities and retrieval in large scale text collections

Information Retrieval and Web Search

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Computer Science 572 Exam Prof. Horowitz Wednesday, February 22, 2017, 8:00am 8:50am

Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University

Information Retrieval

Information Retrieval: Retrieval Models

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Boolean Model. Hongning Wang

CSCI 5417 Information Retrieval Systems. Jim Martin!

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Vector Space Scoring Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Informa(on Retrieval

Information Retrieval

Search Engines and Learning to Rank

Representation of Documents and Infomation Retrieval

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Glossary of IR terms

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression.

Information Retrieval. Lecture 7

Information Retrieval

THIS LECTURE. How do we know if our results are any good? Results summaries: Evaluating a search engine. Making our good results usable to a user

Informa(on Retrieval

Lecture 5: Evaluation

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Experiment Design and Evaluation for Information Retrieval Rishiraj Saha Roy Computer Scientist, Adobe Research Labs India

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

The Web document collection

Ranking Algorithms For Digital Forensic String Search Hits

CS 6320 Natural Language Processing

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

CISC689/ Information Retrieval Midterm Exam

CS145: INTRODUCTION TO DATA MINING

Information Retrieval. Information Retrieval and Web Search

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Exam IST 441 Spring 2011

CS54701: Information Retrieval

Information Retrieval

Evaluating Classifiers

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Part 7: Evaluation of IR Systems Francesco Ricci

Improving the Efficiency of Multi-site Web Search Engines

Evaluating Classifiers

Information Retrieval and Web Search

Exam IST 441 Spring 2013

2. Give an example of algorithm instructions that would violate the following criteria: (a) precision: a =

Recommender Systems 6CCS3WSN-7CCSMWAL

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Instructor: Stefan Savev

Natural Language Processing

Performance Evaluation

Modern Information Retrieval

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Exam IST 441 Spring 2014

Where Should the Bugs Be Fixed?

Midterm Exam Search Engines ( / ) October 20, 2015

Query Answering Using Inverted Indexes

CMPSCI 646, Information Retrieval (Fall 2003)

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

Overview. Lecture 6: Evaluation. Summary: Ranked retrieval. Overview. Information Retrieval Computer Science Tripos Part II.

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

Chapter 9. Classification and Clustering

Vector Space Models. Jesse Anderton

Information Retrieval

Similarity search in multimedia databases

Keyword Search in Databases

Classification Part 4

Evaluating Classifiers

CS229 Final Project: Predicting Expected Response Times

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Transcription:

Chapter III.2: Basic ranking & evaluation measures 1. TF-IDF and vector space model 1.1. Term frequency counting with TF-IDF 1.2. Documents and queries as vectors 2. Evaluating IR results 2.1. Evaluation methods 2.2. Unranked evaluation measures Based on Manning/Raghavan/Schütze, Chapters 6 and 8 III.2-1

TF-IDF and vector-space model In Boolean case we considered each document as a set of words A query matches all documents where terms appear at least once But if a term appears more than once is it not more important? Instead of Boolean queries, use free text queries E.g. normal Google queries Query seen as a set of words Document is a better match if it has the query words more often III.2-2

TF and IDF Term frequency of term t in document d, tft,d, is just the number of times t appears in d Naïve scoring: score of document d for query q is the sum of tft,ds for terms t in query q But some terms appear overall more often than others Document frequency of term t, dft, is the number of documents in which t appears Inverse document frequency of term t, idft, is idf t = log N df t where N is the total number of documents III.2-3

TF-IDF The TF-IDF of term t in document d is tf-idft,d = tft,d idft tf-idft,d is high when t occurs often in d but rarely in other documents tf-idft,d is smaller if either t occurs fewer times in d or t occurs more often in other documents Slightly less naïve scoring Score(q, d) = t q tf-idf t,d III.2-4

Variations to TF-IDF Sublinear tf scaling addresses the problem that 20 occurrences of a word is probably not 20 times more important than 1 occurrence Maximum tf normalization tries to overcome the problem that longer documents yield higher tf-idf scores wf t,d = 1 + log tf t,d if tf t,d > 0 0 otherwise tf t,d ntf t,d = a +(1 a) tf max (d) tfmax(d) is the largest term frequency in document d a is a smoothing parameter, 0 < a < 1 (typically a = 0.4) III.2-5

Documents as vectors Documents can be represented as M-dimensional vectors M is the number of term in vocabulary Vector values are the tf-idf (or similar) values Does not store the order of the terms in documents Document collection can be represented as M-by-N matrix Each document is a column vector Queries can also be considered as vectors Each query is just a short document III.2-6

The vector space model The similarity between two vectors can be computed using cosine similarity sim(d 1, d 2 )=v(d 1 ), v(d 2 ) v(d) is the normalized vector representation of document d A normalized version of vector v is v/ v Thus cosine similarity is equivalently sim(d 1, d 2 )= V(d 1), V(d 2 ) V(d 1 )V(d 2 ) Cosine similarity is the cosine of the angle between v(d1) and v(d2) III.2-7

Finding the best documents Using cosine similarity and vector space model, we can obtain the best document d* with respect to query q from document collection D as d* = arg maxd D v(q),v(d). This can be easily extended to top-k documents. III.2-8

Evaluating IR results We need a way to evaluate different IR systems What pre-processing should I do? Should I use tf-idf or wf-idf or ntf-idf? Is cosine similarity good similarity measure, or should I use something else? etc. For this we need evaluation data and evaluation metrics III.2-9

IR evaluation data Document collections with documents labeled relevant or irrelevant for different information needs Information need is not a query; it is turned into a query E.g. What plays of Shakespeare have characters Caesar and Brutus, but not character Calpurnia? For tuning parameters, document collections are divided to development (or training) and test sets Some real-world data sets exist that are commonly used to evaluate IR methods III.2-10

Classifying the results The retrieved documents can be either relevant or irrelevant and same for not retrieved documents We would like to retrieve relevant documents and not retrieve irrelevant ones We can classify all documents into four classes retrieved relevant true posi*ves (tp) irrelevant false posi*ves (fp) not retrieved false nega*ves (fn) true nega*ves (tn) III.2-11

Unranked evaluation measures Precision, P, is the fraction of retrieved documents that are relevant P = tp tp + fp Recall, R, is the fraction of relevant documents that are retrieved R = tp tp + fn Accuracy, acc, is the fraction of correctly classified documents acc = tp + tn tp + fp + tn + fn III.2-12

Unranked evaluation measures Precision, P, is the fraction of retrieved documents that are relevant P = tp tp + fp Recall, R, is the fraction of relevant documents that are retrieved R = tp tp + fn Accuracy, acc, is the fraction of correctly classified documents acc = tp + tn tp + fp + tn + fn Not appropriate for IR problems III.2-12

The F measure Different tasks may emphasize precision or recall Web search, library search,... But usually some type of balance is sought Maximizing either one is usually easy if other can be arbitrarily low The F measure is a trade-off between precision and recall F β = (β2 + 1)PR β 2 P + R The β is a trade-off parameter: β = 1 is balanced F, β < 1 emphasize precision, and β > 1 emphasize recall III.2-13

F measure example P = 0.8, R = 0.3 1 P = 0.6, R = 0.6 P = 0.2, R = 0.9 F 0,5 0 0,25 0,5 0,75 1 1,25 1,5 1,75 2 2,25 β III.2-14

Ranked evaluation measures Precision as a function of retrieval P(r) What is the precision after we have obtained 10% of relevant documents in ranked results? Interpolated precision at recall level r is maxr rp(r ) Precision recall curves Precision at k (P@k) The precision after we have obtained top-k documents (relevant or not) Typically k=5, 10, 20 E.g. web search Fβ@k = ((β 2 + 1)P@k R@k)/(β 2 P@k + R@k) III.2-15

Mean Average Precision Precision, recall, and F measure are unordered measures Mean average precision (MAP) averages over different information needs and ranked results Let {d1,..., dmj} be the set of relevant documents for qj Q Let Rjk be the set of ranked retrieval results of qj from top result until you get to document dk MAP is MAP(Q) = 1 Q Q j=1 1 m j m j k=1 Precision(R jk ) III.2-16

Mean Average Precision Precision, recall, and F measure are unordered measures Mean average precision (MAP) averages over different information needs and ranked results Let {d1,..., dmj} be the set of relevant documents for qj Q Let Rjk be the set of ranked retrieval results of qj from top result until you get to document dk MAP is MAP(Q) = 1 Q Q j=1 1 m j m j k=1 Precision(R jk ) Average precision III.2-16

Measures for weighted relevance Non-binary relevance 0 = not relevant, 1 = slightly relevant, 2 = more relevant,... Discounted cumulative gain (DCG) for information need q: R(q,d) {0, 1, 2,...} is the relevance of document d for query q k DCG(q, k) = m=1 2 R(q,m) 1 log(1 + m) Normalized discounted cumulative gain (NDCG): NDCG(q, k) = DCG(q, k) IDCG(q, k) III.2-17

Ideal discounted cumulative gain (IDCG) Let rank levels be {0, 1, 2, 3} Order rankings in descending order Iq = (3, 3,..., 3, 2, 2,..., 2, 1, 1,..., 1, 0, 0,..., 0) IDCG(q, k) = k m=1 2 I q(m) 1 log(1 + m) IDCG(q, k) is the maximum value that DCG(q, k) can attain Therefore, NDCG(q, k) 1 III.2-18