Assignment 1. Assignment 2. Relevance. Performance Evaluation. Retrieval System Evaluation. Evaluate an IR system

Similar documents
Using TAR 2.0 to Streamline Document Review for Japanese Patent Litigation

Evaluation. David Kauchak cs160 Fall 2009 adapted from:

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Retrieval Evaluation. Hongning Wang

CS47300: Web Information Search and Management

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Information Retrieval

Evaluation of Retrieval Systems

Assignment No. 1. Abdurrahman Yasar. June 10, QUESTION 1

Lecture 5: Information Retrieval using the Vector Space Model

Information Retrieval

Chapter 8. Evaluating Search Engine

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Information Retrieval

INFS 427: AUTOMATED INFORMATION RETRIEVAL (1 st Semester, 2018/2019)

Session 10: Information Retrieval

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

Modern Retrieval Evaluations. Hongning Wang

CSCI 5417 Information Retrieval Systems. Jim Martin!

Search Engine Architecture. Hongning Wang

Automatically Generating Queries for Prior Art Search

Comparative Analysis of Clicks and Judgments for IR Evaluation

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

Information Retrieval

Recap: lecture 2 CS276A Information Retrieval

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Noida institute of engineering and technology,greater noida

Plan for today. CS276B Text Retrieval and Mining Winter Vector spaces and XML. Text-centric XML retrieval. Vector spaces and XML

Domain Specific Search Engine for Students

Page Replacement. (and other virtual memory policies) Kevin Webb Swarthmore College March 27, 2018

Information Search in Web Archives

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

Welcome to the class of Web Information Retrieval!

Information Retrieval

Information Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007

Enabling Users to Visually Evaluate the Effectiveness of Different Search Queries or Engines

Information Retrieval. Lecture 7

Retrieval Evaluation

+ Page 4 + Abstract. 1.0 Introduction

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

Information Retrieval. (M&S Ch 15)

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014

Information Retrieval

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Integer Algorithms and Data Structures

Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University

4 KARNAUGH MAP MINIMIZATION

Refinement of Web Search using Word Sense Disambiguation and Intent Mining

Modern Information Retrieval

Overview. Lecture 6: Evaluation. Summary: Ranked retrieval. Overview. Information Retrieval Computer Science Tripos Part II.

Exam IST 441 Spring 2011

Information Retrieval

A Security Model for Multi-User File System Search. in Multi-User Environments

Information Retrieval and Web Search

Exam IST 441 Spring 2014

2015 User Satisfaction Survey Final report on OHIM s User Satisfaction Survey (USS) conducted in autumn 2015

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17

Information Retrieval

Web Information Retrieval. Exercises Evaluation in information retrieval

CS347. Lecture 2 April 9, Prabhakar Raghavan

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

Subjective Relevance: Implications on Interface Design for Information Retrieval Systems

Information Retrieval

Building Test Collections. Donna Harman National Institute of Standards and Technology

THIS LECTURE. How do we know if our results are any good? Results summaries: Evaluating a search engine. Making our good results usable to a user

Part 11: Collaborative Filtering. Francesco Ricci

Query Phrase Expansion using Wikipedia for Patent Class Search

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Patent Terminlogy Analysis: Passage Retrieval Experiments for the Intellecutal Property Track at CLEF

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

vector space retrieval many slides courtesy James Amherst

Recap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval

Fractional Similarity : Cross-lingual Feature Selection for Search

Performance Evaluation

Information Retrieval

Open Source Search. Andreas Pesenhofer. max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria

CS54701: Information Retrieval

Information Retrieval

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Discounted Cumulated Gain based Evaluation of Multiple Query IR Sessions

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

Search engines. Børge Svingen Chief Technology Officer, Open AdExchange

CS105 Introduction to Information Retrieval

Information Retrieval and Web Search

Information Retrieval

Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied

A new interaction evaluation framework for digital libraries

Document Structure Analysis in Associative Patent Retrieval

6. Something which has easily understood instructions is said to be: 1) User friendly 2) Information 3) Word Processing 4) Icon 5) None of these

Inter and Intra-Document Contexts Applied in Polyrepresentation

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

SOCIAL SCIENCE RESEARCH FOR LAWYERS. With Elisabeth McKechnie

FROM INFORMATION TO INSIGHT FOR INFO PROS: Demonstrating Your Library s Value to End Users and to Upper Management

Corso di Biblioteche Digitali

A Taxonomy of Web Search

Transcription:

Retrieval System Evaluation W. Frisch Institute of Government, European Studies and Comparative Social Science University Vienna Assignment 1 How did you select the search engines? How did you find the search engines? How did you evaluate the systems? How did you compare the systems? Did you test the system? Functionally? Performance? Systematically? Assignment 2 Get your account ready Understand what you need to do Use some Cut-and-Paste for answers. Evaluate an IR system Functional Evaluation Functional analysis Does the system provide most of the functions that the user expects? What are unique functions of this system? How user-friendly is the system? Error Analysis How often does the system fail? How easy does the user make errors? Performance Evaluation Given a query, how well will the system perform? How do we define the retrieval performance? Is finding all the related information our goal? Is it possible to know that the system has found all the information? Given user s information needs, how well will the system perform? Is the information found useful? -- Relevance Relevance Relevance Dictionary Definition: 1. Pertinence to the matter at hand. 2. Applicability to social issues. 3. Computer Science. The capability of an information retrieval system to select and retrieve data appropriate to a user's needs. 1

Relevance for IR A measurement of the outcome of a search The judgment on what should or should not be retrieved There are no simple answers to what is relevant and what is not relevant difficult to define subjective depending on knowledge, needs, time, situation, etc. The central concept of information retrieval Relevance to What? Information Needs Problems? requests? queries? The final test of relevance is if users find the information useful if users can use the information to solve the problems they have if users fill information gap they perceived. Relevance Judgment The user's judgment How well the retrieved documents satisfy the user's information needs How useful the retrieved documents If it is related but not useful, It is still not relevant The system's judgment How well the retrieved document match the query How likely would the user judge this information as useful? Factors for Relevance Judgment Subjects: Judge by their subject relatedness Novelty: -- how much new information in the retrieved document Uniqueness/Timeliness Quality/Accuracy/Truth Availability Source or pointer? Accessibility Cost Language English or non-english Readability Relevance Measurement Binary relevant or not relevant Likert scale Not relevant, somewhat relevant, relevant, highly relevant Precision and Recall Given a query, how many documents should a system retrieve: Are all the retrieved documents relevant? Have all the relevant documents been retrieved? Measures for system performance: The first question is about the precision of the search The second is about the completeness (recall) of the search. 2

Number of relevant documents retrieved Retrieved Relevant a Not Relevant b Precision = -------------------------------------------- Total number of documents retrieved Not retrieved c d a P = -------------- a+b a R = -------------- a+c Number of relevant documents retrieved Recall = ----------------------------------------------------- Number of all the relevant documents in the database Precision measures how precise a search is. the higher the precision, the less unwanted documents. Recall measures how complete a search is. the higher the recall, the less missing documents. Relationship of R and P Theoretically, R and P not depend on each other. Practically, High Recall is achieved at the expense of precision. High Precision is achieved at the expense of recall. When will p = 0? Only when none of the retrieved documents is relevant. When will p=1? Only when every retrieved documents are relevant. What does p=0.75 mean? What does r=.25 mean? What is your goal (in term of p & r ) when conducting a search? Depending on the purpose of the search Depending on information needs Depending on the system What values of p and r would indicate a good system or good search? There is not a fixed value. Why increasing recall will often mean decreasing precision? In order not to miss anything and to cover all possible sources, one would have to scan many more materials, of which many might be not relevant. 3

Ideal Retrieval Systems Ideal IR system would have P=1, R= 1, for all the queries Is it possible? Why? If information needs could be defined very precisely, and If relevance judgments could be done unambiguously, and If query matching could be designed perfectly, Then We would have an ideal system. Then It is not an information retrieval system. Alternative measures Combining recall and precision 2 F = ------------------------- 1/R + 1/ P 1 + k 2 E = ------------------------- 2 k / R + 1/ P User-Oriented Measures Measure: Coverage Relevant docs Retrieved Docs Coverage: the fraction of the documents known to the user to be relevant which has actually been retrieved Relevant docs Known to the user Relevant docs retrieved unknown to the user Coverage = ----------------------------------------- If coverage=1, Relevant Docs retrieved and known to the user Relevant Docs known to the user Everything the user knows has been retrieved. Measure: Novelty Novelty: the fraction of the relevant documents retrieved which was unknown to the user. Relevant docs unknown to the user Novelty= -------------------------------- Relevant docs retrieved Evaluation of IR Systems Using Recall & Precision Conduct query searches Try many different queries Results may depend on sampling queries. Compare results of Precision & Recall Recall & Precision need to be considered together. 4

Use Precision and Recall to Evaluate IR Systems P 1.0 P-R diagram System A P /R System A System B System C Query 1 Query 2 Query 3 Query 4 Query 5 0.9 / 0.1 0.7 / 0.4 0.45/0.5 0.3/0.6 0.1/ 0.8 0.8/ 0.2 0.5/ 0.3 0.4/0.5 0..3/0.7 0.2/0.8 0.9/ 0.4 0.7/ 0.6 0.5/ 0.7 0.3/0.8 0.2/ 0.9 0.5 System B System C 0.1 0.1 0.5 1.0 R Use fixed interval levels of Recall to compare Precision Use fixed intervals of the number of retrieved documents to compare Precision Number of relevant documents Precision System 1 System 2 System 3 System A Query 1 Query 2 Query 3 Average Precision R=.25 R=.50 R=.75 0.6 0.7 0.9 0.5 0.4 0.7 0.2 0.3 0.4 N=10 N=20 N=30 N=40 N=50 4 5 6 0.5 4 5 16 0.41 5 5 17 0.3 8 6 24 0.31 10 6 25 0.27 Number of documents retrieved Problems using P/R for Evaluation For real world system, Recall is always an estimate. Results depend on sampling queries. Recall and Precision do not catch interactive aspect of the retrieval process. Recall & Precision is only one aspect of system performance High recall/high precision is desirable, but not necessary the most important thing that the user considers. R and P are based on the assumption that the set of relevant documents for a query is the same, independent of the user. Quality Evaluation Data quality Coverage of database It will not be found if it is not in the database. Completeness and accuracy of data Indexing methods and indexing quality It will not be found if it is not indexed. indexing types currency of indexing ( Is it updated often?) indexing sizes 5

Web Coverage: total 320 million pages Examples: Invalid links Interface Consideration User friendly interface How long does it take for a user to learn advanced features? How well can the user explore or interact with the query output? How easy is it to customize output displays? User Satisfaction User satisfaction The final test is the user! User satisfaction is more important then precision and recall Measuring user satisfaction Survey Use statistics User experiments User Experiments Observe and collect data on System behaviors User search behaviors User-system interaction Interpret experiment results for system comparisons for understanding user s information seeking behaviors for developing new retrieval systems/interfaces An Landmark Study An evaluation of retrieval effectiveness for a full-text document retrieval system 1985, by David Blair and M. E. Maron The first large-scale evaluation on fulltext retrieval Significant and controversial results Good experimental Design 6

Recall The Setting An IBM full-text retrieval system with 40,000 documents of $350,000 pages. Documents to be used in the defense of a large corporate law suit. Large by 1985 standards; typical standard today Mostly Boolean searching functions, with some ranking functions added. Full-text automatic indexing. The Experiment Two lawyers generated 51 requests. Two paralegals conducted searches again and again until the lawyers satisfied the results Until the lawyers believed that more than 75% of relevant documents had been found. The paralegals and lawyers could have as many discussions as needed. The results Average precision=.79 Average Recall=.20 1.0 Precision Calculation The lawyers judged vita, satisfactory, marginally relevant, and irrelevant All the first three were counted as relevant in precision calculation..20.20 Precision 1.0 Recall Calculation Sampling from a subset of the database believed to be rich in relevant documents Mixed with retrieved sets to send to the lawyers for relevant judgments The most significant results The recall is low. Even though the recall is only 20%, the lawyers were satisfied (and believed that 75% of relevant documents had been retrieved). 7

Questions Why the recall was so low? Do we really need high recall? If the study were run today on search engines like Google, would the results be the same or different? Discussion: Levels of Evaluation On the engineering level On the input level On the processing level On the output level On the use and user level On the social level --- Tefko Saracevic, SIGIR 95 8