Information Retrieval. Techniques for Relevance Feedback

Similar documents
Query reformulation CE-324: Modern Information Retrieval Sharif University of Technology

Query reformulation CE-324: Modern Information Retrieval Sharif University of Technology

Relevance Feedback and Query Expansion. Debapriyo Majumdar Information Retrieval Spring 2015 Indian Statistical Institute Kolkata

The IR Black Box. Anomalous State of Knowledge. The Information Retrieval Cycle. Different Types of Interactions. Upcoming Topics.

Lecture 7: Relevance Feedback and Query Expansion

Information Retrieval

Information Retrieval and Web Search

Chapter 6: Information Retrieval and Web Search. An introduction

Refining searches. Refine initially: query. Refining after search. Explicit user feedback. Explicit user feedback

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Birkbeck (University of London)

EPL660: INFORMATION RETRIEVAL AND SEARCH ENGINES. Slides by Manning, Raghavan, Schutze

Query Refinement and Search Result Presentation

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Information Retrieval

Boolean Queries. Keywords combined with Boolean operators:

Sec. 8.7 RESULTS PRESENTATION

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Information Retrieval - Query expansion. Jian-Yun Nie

CS 6320 Natural Language Processing

Lecture 5: Information Retrieval using the Vector Space Model

Information Retrieval and Web Search

Information Retrieval. (M&S Ch 15)

Information Retrieval

Information Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007

Information Retrieval

Query Suggestion. A variety of automatic or semi-automatic query suggestion techniques have been developed

Natural Language Processing

INFSCI 2140 Information Storage and Retrieval Lecture 6: Taking User into Account. Ad-hoc IR in text-oriented DS

Introduction to Information Retrieval

Information Retrieval. Information Retrieval and Web Search

Modern Information Retrieval

21. Search Models and UIs for IR

Semantic Search in s

VECTOR SPACE CLASSIFICATION

Information Retrieval

Chapter 27 Introduction to Information Retrieval and Web Search

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Relevance Feedback & Other Query Expansion Techniques

Categorisation tool, final prototype

Query Operations. Relevance Feedback Query Expansion Query interpretation

IN4325 Query refinement. Claudia Hauff (WIS, TU Delft)

Web Information Retrieval. Exercises Evaluation in information retrieval

Boolean Model. Hongning Wang

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

- Content-based Recommendation -

Introduction to Information Retrieval

Static Pruning of Terms In Inverted Files

A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System

Utilizing User-input Contextual Terms for Query Disambiguation

modern database systems lecture 4 : information retrieval

CS6200 Information Retrieval. Jesse Anderton College of Computer and Information Science Northeastern University

Retrieval Evaluation. Hongning Wang

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

Information Retrieval

Document Expansion for Text-based Image Retrieval at CLEF 2009

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Information Retrieval: Retrieval Models

Document Clustering for Mediated Information Access The WebCluster Project

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

Information Retrieval

Combining Information Retrieval and Relevance Feedback for Concept Location

Topics in Information Retrieval

Hybrid Approach for Query Expansion using Query Log

CSCI 5417 Information Retrieval Systems. Jim Martin!

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Query Expansion with the Minimum User Feedback by Transductive Learning

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Information Retrieval and Data Mining Part 1 Information Retrieval

Information Retrieval (Part 1)

A Review on Relevance Feedback in CBIR

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

A Survey of Query Expansion until June 2012

Multimedia Information Systems

CS54701: Information Retrieval

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 1)"

Real-time Query Expansion in Relevance Models

VK Multimedia Information Systems

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Manning Chapter: Text Retrieval (Selections) Text Retrieval Tasks. Vorhees & Harman (Bulkpack) Evaluation The Vector Space Model Advanced Techniques

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Part 7: Evaluation of IR Systems Francesco Ricci

Social Media Computing

Chapter 2. Architecture of a Search Engine

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Towards Understanding Latent Semantic Indexing. Second Reader: Dr. Mario Nascimento

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

Information Retrieval. Lecture 9 - Web search basics

Information Retrieval

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Chapter 6. Queries and Interfaces

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Retrieval by Content. Part 3: Text Retrieval Latent Semantic Indexing. Srihari: CSE 626 1

Exam IST 441 Spring 2011

Transcription:

Information Retrieval Techniques for Relevance Feedback

Introduction An information need may be epressed using different keywords (synonymy) impact on recall eamples: ship vs boat, aircraft vs airplane Solutions: refining queries manually or epanding queries (semi) automatically Semi-automatic query epansion: local methods global methods based on the retrieved documents and the query (e: Relevance Feedback) independent of the query and results (e: thesaurus, spelling corrections) 2/ 32

About Relevance Feedback About Relevance Feedback Feedback given by the user about the relevance of the documents in the initial set of results 4/ 32

About Relevance Feedback About Relevance Feedback (continued) Based on the idea that: (i) defining good queries is difficult when the collection is (partly) unknown (ii) judging particular documents is easy Allows to deal with situations where the user s information needs evolve with the checking of the retrieved documents Eample: image search engine http://nayana.ece.ucsb.edu/imsearch/imsearch.html 5/ 32

About Relevance Feedback Relevance feedback eample 1 6/ 32

About Relevance Feedback Relevance feedback eample 1 7/ 32

About Relevance Feedback Relevance feedback eample 1 8/ 32

Eample 2 V V V X X V * 國立歷史博物館 / 師大 / 新視提供 11

The Rocchio algorithm The Rocchio algorithm Standard algorithm for relevance feedback (SMART, 70s) Integrates a measure of relevance feedback into the Vector Space Model Idea: we want to find a query vector q opt maimizing the similarity with relevant documents while minimizing the similarity with non-relevant documents q opt = argma q [sim( q, C r ) sim( q, C nr )] With the cosine similarity, this gives: q opt = 1 dj 1 C r C nr d j C r d j C nr dj 12/ 32

The Rocchio algorithm The Rocchio algorithm (continued) Problem with the above metrics: the set of relevant documents is unknown Instead, we produce the modified query m: q m = α q 0 + β 1 D r dj γ 1 D nr d j D r d j D nr dj where: q 0 is the original query vector D r is the set of known relevant documents D nr is the set of known non-relevant documents α, β, γ are balancing weights (judge vs system) 13/ 32

Relevance feedback on initial query Initial query o o Δ o Δ o o o Revised query known non-relevant documents o known relevant documents 16

The Rocchio algorithm The Rocchio algorithm (continued) Remarks: Negative weights are usually ignored Rocchio-based relevance feedback improves both recall and precision For reaching high recall, many iterations are needed Empirically determined values for the balancing weights: α = 1 β = 0.75 γ = 0.15 Positive feedback is usually more valuable than negative feedback: β > γ 14/ 32

The Rocchio algorithm Rocchio algorithm: eercise Consider the following collection (one doc per line): good movie trailer shown trailer with good actor unseen movie a dictionary made of the words movie, trailer and good, and an IR system using the standard tf idf weighting (without normalisation). Assuming a user judges the first 2 documents relevant for the query movie trailer. What would be the Rocchio-revised query? 15/ 32

Relevance Feedback: Assumptions A1: User has sufficient knowledge for initial query. A2: Relevance prototypes are well-behaved. Term distribution in relevant documents will be similar Term distribution in non-relevant documents will be different from those in relevant documents Either: All relevant documents are tightly clustered around a single prototype. Or: There are different prototypes, but they have significant vocabulary overlap. Similarities between relevant and irrelevant documents are small 19

Violation of A1 User does not have sufficient initial knowledge. Eamples: Misspellings Brittany Speers(wrong) / Britney Spears(correct) Cross-language information retrieval. Mismatch of searcher s vocabulary vs. collection vocabulary hotel / inn / tavern 20

Violation of A2 There are several relevance prototypes. Eamples: Burma/Myanmar Pop stars that worked at Burger King 21

Relevance Feedback: Problems Long queries are inefficient for typical IR engine. Long response times for user. High cost for retrieval system. Partial solution: Only reweight certain prominent terms Perhaps top 20 by term frequency Why? Users are often reluctant to provide eplicit feedback It s often harder to understand why a particular document was retrieved after applying relevance feedback 22

Evaluation of Relevance Feedback strategies Evaluation of Relevance Feedback strategies Note that improvements brought by the relevance feedback decrease with the number of iterations, usually one round gives good results Several evaluation strategies: (a) comparative evaluation query q 0 prec/recall graph query q m prec/recall graph usually +50% of mean average precision (partly comes from the fact that known relevant documents are higher ranked) 19/ 32

Evaluation of Relevance Feedback strategies Evaluation of Relevance Feedback strategies (continued) Evaluation strategies: (b) residual collection same technique as above but by looking at the set of retrieved documents - the set of assessed relevant documents the performance measure drops (c) using two similar collections collection #1 is used for querying and giving relevance feedback collection #2 is used for comparative evaluation q 0 and q m are compared on collection #2 20/ 32

Evaluation of Relevance Feedback strategies Evaluation of Relevance Feedback strategies (continued) Evaluation strategies: (d) user studies e.g. time-based comparison of retrieval, user satisfaction, etc. user utility is a fair evaluation as it corresponds to real system usage 21/ 32

Other local methods for query epansion Pseudo Relevance Feedback Aka blind relevance feedback No need of an etended interaction between the user and the system Method: normal retrieval to find an initial set of most relevant documents assumption that the top k documents are relevant relevance feedback defined accordingly Works with the TREC Ad Hoc task lnc.ltc (precision at k = 50): no-rf 62.5 %, RF 72.7 % Problem: distribution of the documents may influence the results 23/ 32

Other local methods for query epansion Indirect Relevance Feedback Uses evidences rather than eplicit feedback Eample: number of clicks on a given retrieved document Not user-specific More suitable for web IR, since it does not need an etra action from the user 24/ 32

Global methods for query epansion Vocabulary tools for query reformulation Tools displaying: a list of close terms belonging to the dictionary information about the query words that were omitted (cf stop-list) the results of stemming debugging environnement 26/ 32

Global methods for query epansion Query logs and thesaurus Users select among query suggestions that are built either from query logs or thesaurus Replacement words are etracted from thesaurus according to their proimity to the initial query word Thesaurus can be developed: manually (e.g. biomedicine) automatically (cf below) NB: query epansion (i) increases recall (ii) may need users relevance on query terms ( documents) 27/ 32

Global methods for query epansion Automatic thesaurus generation Analyze of the collection for building the thesaurus automatically: 1. Using word co-occurrences (co-occurring words are more likely to belong to the same query field) may contain false positives (eample: apple) 2. Using a shallow grammatical analyzes to find out relations between words eample:cooked, eaten, digested food Note that co-occurrence-based thesaurus are more robust, but grammatical-analyzes thesaurus are more accurate 28/ 32

Co-occurrence Thesaurus Simplest way to compute one is based on termterm similarities in C = AA T where A is termdocument matri. w i,j = (normalized) weight for (t i,d j ) d j N t i M For each t i, pick terms with high values in C 33

Automatic Thesaurus Generation Eample 34

Global methods for query epansion Conclusion Query epansion using either local methods: Rocchio algorithm for Relevance Feedback Pseudo Relevance Feedback Indirect Relevance Feedback or global ones: Query logs Thesaurus Thesaurus-based query epansion increases recall but may decrease precision (cf ambiguous terms) High cost of thesaurus development and maintenance Thesaurus-based query epansion is less efficient than Rocchio Relevance Feedback but may be as good as Pseudo Relevance Feedback 31/ 32

Global methods for query epansion References C. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval http://nlp.stanford.edu/ir-book/pdf/ chapter09-queryepansion.pdf Chris Buckley and Gerard Salton and James Allan The effect of adding relevance information in a relevance feedback environment (1994) http://citeseer.ist.psu.edu/buckley94effect.html Ian Ruthven and Mounia Lalmas A survey on the use of relevance feedback for information access systems (2003) http://personal.cis.strath.ac.uk/ ir/papers/ker.pdf 32/ 32