Evaluation of Retrieval Systems

Similar documents
Evaluation of Retrieval Systems

Evaluation of Retrieval Systems

Evaluation of Retrieval Systems

Modern Information Retrieval

Building Test Collections. Donna Harman National Institute of Standards and Technology

CS47300: Web Information Search and Management

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Information Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007

Information Retrieval

Retrieval Evaluation

CSCI 5417 Information Retrieval Systems. Jim Martin!

Chapter 8. Evaluating Search Engine

THIS LECTURE. How do we know if our results are any good? Results summaries: Evaluating a search engine. Making our good results usable to a user

Information Retrieval and Web Search

Web Information Retrieval. Exercises Evaluation in information retrieval

Information Retrieval

Part 7: Evaluation of IR Systems Francesco Ricci

CS6322: Information Retrieval Sanda Harabagiu. Lecture 13: Evaluation

Retrieval Evaluation. Hongning Wang

Information Retrieval. Lecture 7

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

CS54701: Information Retrieval

Chap. 3. Chap. 3. Recall and Precision Alternative Measures. TREC Collection CACM and ISI Collections CFC (Cystic Fibrosis Collection)

Information Retrieval. Lecture 3: Evaluation methodology

M erg in g C lassifiers for Im p ro v ed In fo rm a tio n R e triev a l

Information Retrieval

Information Retrieval

Part 11: Collaborative Filtering. Francesco Ricci

Informa(on Retrieval

Assignment 1. Assignment 2. Relevance. Performance Evaluation. Retrieval System Evaluation. Evaluate an IR system

Information Retrieval. (M&S Ch 15)

Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied

Overview. Lecture 6: Evaluation. Summary: Ranked retrieval. Overview. Information Retrieval Computer Science Tripos Part II.

TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

The TREC 2005 Terabyte Track

Information Retrieval

Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University

Searching the Deep Web

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

Introduction to Information Retrieval

An Evaluation Method of Web Search Engines Based on Users Sense

Evaluation of Web Search Engines with Thai Queries

Assignment No. 1. Abdurrahman Yasar. June 10, QUESTION 1

Overview of the TREC 2013 Crowdsourcing Track

Searching the Deep Web

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Performance Evaluation

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Evaluation. David Kauchak cs160 Fall 2009 adapted from:

Comparative Analysis of Clicks and Judgments for IR Evaluation

Citation for published version (APA): He, J. (2011). Exploring topic structure: Coherence, diversity and relatedness

CLARIT Compound Queries and Constraint-Controlled Feedback in TREC-5 Ad-Hoc Experiments

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Information Retrieval Spring Web retrieval

The Security Role for Content Analysis

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

Retrieval and Feedback Models for Blog Distillation

Information Retrieval and Web Search Engines

Part 11: Collaborative Filtering. Francesco Ricci

Lecture 5: Evaluation

BUPT at TREC 2009: Entity Track

Accessing XML documents: The INEX initiative. Mounia Lalmas, Thomas Rölleke, Zoltán Szlávik, Tassos Tombros (+ Duisburg-Essen)

Multimedia Information Retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Study on Merging Multiple Results from Information Retrieval System

Automatic Search Engine Evaluation with Click-through Data Analysis. Yiqun Liu State Key Lab of Intelligent Tech. & Sys Jun.

Chapter III.2: Basic ranking & evaluation measures

Automatic Query Type Identification Based on Click Through Information

Ryen W. White, Matthew Richardson, Mikhail Bilenko Microsoft Research Allison Heath Rice University

Verbosity and Interface Design

TREC-3 Ad Hoc Retrieval and Routing. Experiments using the WIN System. Paul Thompson. Howard Turtle. Bokyung Yang. James Flood

Using Coherence-based Measures to Predict Query Difficulty

Informa(on Retrieval

This lecture. Measures for a search engine EVALUATING SEARCH ENGINES. Measuring user happiness. Measures for a search engine

dr.ir. D. Hiemstra dr. P.E. van der Vet

Siemens TREC-4 Report: Further Experiments with Database. Merging. Ellen M. Voorhees. Siemens Corporate Research, Inc.

Sec. 8.7 RESULTS PRESENTATION

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

Lab 2 Test collections

US Patent 6,658,423. William Pugh

Noida institute of engineering and technology,greater noida

SEO KEYWORD SELECTION

Information Retrieval

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Report on TREC-9 Ellen M. Voorhees National Institute of Standards and Technology 1 Introduction The ninth Text REtrieval Conf

Performance Measures for Multi-Graded Relevance

A Taxonomy of Web Search

Extended Performance Graphs for Cluster Retrieval

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Federated Text Search

Reducing Redundancy with Anchor Text and Spam Priors

Benchmarks, Performance Evaluation and Contests for 3D Shape Retrieval

Informativeness for Adhoc IR Evaluation:

Evaluating Arabic Retrieval from English or French Queries: The TREC-2001 Cross-Language Information Retrieval Track

6 WAYS Google s First Page

Transcription:

Evaluation of Retrieval Systems 1 Performance Criteria 1. Expressiveness of query language Can query language capture information needs? 2. Quality of search results Relevance to users information needs 3. Usability Search Interface Results page format Other? 4. Efficiency Speed affects usability Overall efficiency affects cost of operation 5. Other? 2 1

Quantitative evaluation Concentrate on quality of search results Goals for measure Capture relevance to user information need Allow comparison between results of different systems Measures define for sets of documents returned More generally document could be any information object 3 Core measures: Precision and Recall Need binary evaluation by human judge of each retrieved document as relevant/irrelevant Need know complete set of relevant documents within collection being searched Recall = # relevant documents retrieved # relevant documents Precision = # relevant documents retrieved # retrieved documents 4 2

Combine recall and precision F-score (aka F-measure) defined to be: harmonic mean of precision and recall = 2*recall*precision precision+recall The harmonic mean h of two numbers m and n satisfies (n-h)/n = (h-m)/m. Also = (1/m) -(1/h) = (1/h)-(1/n) 5 Use in modern times Defined in 1950s For small collections, these make sense For large collections, Rarely know complete set relevant documents Rarely could return complete set relevant documents For large collections Rank returned documents Use ranking! 6 3

Ranked result list At any point along ranked list Can look at precision so far Can look at recall so far if know total # relevant docs Google s about N results inadequate estimate Can focus on points that relevant docs appears If m th doc in ranking is k th relevant doc so far, precision is k/m No a priori ranking on relevant docs 7 Plot: precision versus recall Choose standard recall levels: r 1, r 2 Eg 10%, 20% Define precision at recall level r j p(r j ) = max over all r with r j r<r j +1 of precision when recall r achieved Similar to Intro IR interpolated precision 8 4

Reproduced from presentation Overview of TREC 2004 by Ellen Voorhees, available from TREC presentations Web site: trec.nist.gov/presentations/trec2004/04overview.pdf 9 Single number characterizations Can look at precision at one fixed critical position: Precision at k If know are R relevant documents can choose k=r May not want to look that far even if know R Can choose set of S relevant docs, and calc. precision at k=s only with respect to these docs R-precision of Intro IR For Web search Choose k to be number pages people look at k=? What expecting? 10 5

Single number characterizations, cont. 1) Record precision at each point a relevant document encountered through ranked list Don t need know all relevant docs Can cut off ranked list at predetermined rank 2) Average the recorded precisions in (1) = average precision for a query result Mean Average Precision (MAP): For a set of test queries, take the mean (i.e. average) Of the average precision for each query Compare retrieval systems with MAP 11 Using Measures Statistical significance versus meaningfulness Use more than one measure Need some set of relevant docs even if don t have complete set How? Look at TREC studies 12 6

Relevance by TREC method Text Retrieval Conference 1992 to present Fixed collection per track E.g. *.gov, CACM articles Each competing search engine for a track asked to retrieve documents on several topics Search engine turns topic into query Topic description has clear statement of what is to be considered relevant by human judge 13 Sample TREC 3 topic: <num> Number: 168 <title> Topic: Financing AMTRAK <desc> Description: A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK). <narr> Narrative: A relevant document must provide information on the government s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to AMTRAK would also be relevant. </top> As appeared in Overview of the Sixth Text REtrieval Conference (TREC-6), E. M. Voorhees and D. Harman, in NIST Special Publication 500-240: The Sixth Text REtrieval Conference, 1997. 14 7

Sample TREC 7 topic: <num>number: 396 <title> sick building syndrome <desc>description: Indentify documents that discuss sick building syndrome or buildingrelated illnesses. <narr> Narrative: A relevant document would contain any data that refers to the sick building or building-related illnesses, including illnesses cause by asbestos, air conditioning, pollution controls. Work-related illnesses not caused by the building, such as carpal tunnel syndrome, are not relevant. From Overview of the Seventh Text REtrieval Conference (TREC-7), E. M. Voorhees and D. Harman, in NIST Special Publication 500-242: The Seventh Text REtrieval Conference, 1998. 15 Pooling Human judges can t look at all docs in collection: thousands to millions Pooling chooses subset of docs of collection for human judges to rate relevance of Assume docs not in pool not relevant 16 8

How construct pool for a topic? Let competing search engines decide: Choose a parameter k (typically 100) Choose the top k docs as ranked by each search engine Pool = union of these sets of docs Between k and (# search engines) * k docs in pool Give pool to judges for relevance scoring 17 Pooling cont. (k+1) st doc returned by one search engine either irrelevant or ranked higher by another search engine in competition In competition, each search engine is judged on results for top r > k docs returned 18 9

Reproduced from presentation Overview of TREC 2004 by Ellen Voorhees, available from TREC presentations Web site: trec.nist.gov/presentations/trec2004/04overview.pdf 19 Web search evaluation Are different kinds of queries identified in TREC Web Track some are: Ad hoc Topic distillation: set of key resources small, 100% recall? Home page: # relevant pages = 1 (except mirrors) Distinguish for competitors or just judges? Andrei Broder gave similar categories Information Broad research or single fact? Transaction Navigation 20 10

More web/online issues Are browser-dependent and presentation dependent issues: On first page of results? See result without scrolling? 21 Other issues in evaluation Degrees of relevance? Discounted Cumulative Gain tries to measure Uses degree of relevance and position in ranking Does retrieving highly relevant documents really satisfy users? Subjectivity? Are there dependences not accounted for? Many searches are interactive 22 11