Relevance Feedback & Other Query Expansion Techniques

Similar documents
Information Retrieval and Web Search

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

Lecture 7: Relevance Feedback and Query Expansion

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

Chapter 6: Information Retrieval and Web Search. An introduction

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Lecture 4: Query Expansion

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Lecture 5: Information Retrieval using the Vector Space Model

A Review on Relevance Feedback in CBIR

Relational Approach. Problem Definition

Information Retrieval and Web Search

Introduction to Information Retrieval

CS 6320 Natural Language Processing

COMP6237 Data Mining Searching and Ranking

CS47300: Web Information Search and Management

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Relational Approach. Problem Definition

Clustering (COSC 416) Nazli Goharian. Document Clustering.

Boolean Queries. Keywords combined with Boolean operators:

Query Refinement and Search Result Presentation

Information Retrieval. Information Retrieval and Web Search

The IR Black Box. Anomalous State of Knowledge. The Information Retrieval Cycle. Different Types of Interactions. Upcoming Topics.

Refining searches. Refine initially: query. Refining after search. Explicit user feedback. Explicit user feedback

Information Retrieval. Techniques for Relevance Feedback

Query Operations. Relevance Feedback Query Expansion Query interpretation

Introduction to Information Retrieval. (COSC 488) Spring Nazli Goharian. Course Outline

Introduction to Information Retrieval

INFSCI 2140 Information Storage and Retrieval Lecture 6: Taking User into Account. Ad-hoc IR in text-oriented DS

A Filtering System Based on Personal Profiles

Information Retrieval. hussein suleman uct cs

BVRIT HYDERABAD College of Engineering for Women. Department of Computer Science and Engineering. Course Hand Out

modern database systems lecture 4 : information retrieval

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

CMPSCI 646, Information Retrieval (Fall 2003)

Query Evaluation Strategies

CS54701: Information Retrieval

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

NATURAL LANGUAGE PROCESSING

Query reformulation CE-324: Modern Information Retrieval Sharif University of Technology

Semantic Search in s

Information Retrieval. (M&S Ch 15)

Query Suggestion. A variety of automatic or semi-automatic query suggestion techniques have been developed

Query Reformulation for Clinical Decision Support Search

Information Retrieval

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

Word Embeddings in Search Engines, Quality Evaluation. Eneko Pinzolas

Chapter 27 Introduction to Information Retrieval and Web Search

Intro to Peer-to-Peer Search

A New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 1)"

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Query Evaluation Strategies

Information Retrieval

Query reformulation CE-324: Modern Information Retrieval Sharif University of Technology

Boolean Model. Hongning Wang

Indexing and Searching

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

Birkbeck (University of London)

Information Technology for Documentary Data Representation

Data Modelling and. Multimedia. Databases M. Multimedia. Information Retrieval Part III. Outline

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Multimedia Information Systems

Indexing and Searching

Clustering (COSC 488) Nazli Goharian. Document Clustering.

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

Key-value stores. Berkeley DB. Bigtable

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

Instructor: Stefan Savev

Clustering. Bruno Martins. 1 st Semester 2012/2013

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Contextual Information Retrieval Using Ontology-Based User Profiles

Modern Information Retrieval

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

Information Retrieval

Sense-based Information Retrieval System by using Jaccard Coefficient Based WSD Algorithm

Multimedia Data Management M

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

The University of Amsterdam at the CLEF 2008 Domain Specific Track

Effective Structured Query Formulation for Session Search

Chapter 2. Architecture of a Search Engine

Sec. 8.7 RESULTS PRESENTATION

Query Phrase Expansion using Wikipedia for Patent Class Search

Exploiting Parallelism to Support Scalable Hierarchical Clustering

TABLE OF CONTENTS. ImageSilo Quick Guide Page 2 of 19

A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System

Controlled vocabularies, taxonomies, and thesauruses (and ontologies)

A Study of Methods for Negative Relevance Feedback

Improving Difficult Queries by Leveraging Clusters in Term Graph

Indexing and Query Processing. What will we cover?

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

Transcription:

Relevance Feedback & Other Query Expansion Techniques (Thesaurus, Semantic Network) (COSC 416) Nazli Goharian nazli@cs.georgetown.edu Slides are mostly based on Informion Retrieval Algorithms and Heuristics, Grossman, Frieder 1 Relevance Feedback The modificion of the search process to improve the effectiveness of an IR system Incorpores informion obtained from prior relevance judgments Basic idea is to do an initial query, get feedback from the user (or automically) as to wh documents are relevant and then add term from known relevant document(s) to the query. 2 1

Relevance Feedback Example Q tunnel under English Channel 2 1 Document Collection Documents Retrieved Relevant Retrieved 3 Not Relevant Top Ranked Document: The tunnel under the English Channel is often called a Chunnel 4 Q1 tunnel under English Channel Chunnel 5 6 Documents Retrieved Relevant Retrieved 3 Feedback Mechanisms Automic (pseudo/ Blind) The good terms from the good, top ranked documents, are selected by the system and added to the users query. Semi-automic User provides feedback as to which documents are relevant (via clicked document or selecting a set of documents); the good terms from those documents are added to the query. Similarly terms can be shown to the user to pick from. Suggesting new queries to the user based on: Query log Clicked document (generally limited to one document) 4 2

Pseudo Relevance Feedback Algorithm Identify good (N top-ranked) documents. Identify all terms from the N top-ranked documents. Select the good (T top) feedback terms. Merge the feedback terms with the original query. Identify the top-ranked documents for the modified queries through relevance ranking. 5 Sort Criteria Methods to select the good terms: n*idf (a reasonable measure) f*idf.. where: n: is number of documents in relevant set having term t f: is frequency of term t in relevant set Goharian, Grossman, Frieder, 2002, 20101 6 3

Example Top 3 documents d1: A, B, B, C, D d2: C, D, E, E, A, A d3: A, A, A Assume idf of A, B, C is 1 and D, E is 2. Term n f n*idf f*idf A 3 6 3 6 B 1 2 1 2 C 2 2 2 2 D 2 2 4 4 E 1 2 2 4 based on n*idf: Top 2 terms D A Top 3 terms: D A C or E 7 Original Rocchio Vector Space Relevance Feedback [1965] Step1: Run the query. Step 2: Show the user the results. Step 3: Based on the user feedback: add new terms to query or increase the query term weights. Remove terms or decrease the term weights. Objective => increase the query accuracy. 8 4

Rocchio Vector Space Relevance Feedback Q ' = α Q 1 i + 1 Q: original query vector + β R: set of relevant document vectors S: set of non-relevant document vectors α, β, γ : constants (Rocchio weights) Q : new query vector n 2 R i γ S n i = 1 i 9 Variions in Vector Model Q Options: ' = α Q + 1 α = 1, β =, γ = R α = β = γ = 1 1 S β n 1 i + 1 2 R i γ S i = 1 Use only first n documents from R and S Use only first document of S Do not use S ( γ = 0) n i 10 5

Implementing Relevance Feedback First obtain top documents, do this with the usual inverted index Now we need the top terms from the top X documents. Two choices Retrieve the top x documents and scan them in memory for the top terms. Use a separe doc-term structure th contains for each document, the terms th will contain th document. 11 Relevance Feedback in Probabilistic Model Need training da for R and r (unlikely) Some other stregy like VSM can be used for the initial pass to get the top n docs, as the relevant docs R can be estimed as the total relevant docs found in top n r is then estimed based on these documents Query can be expanded using the expanded Probabilistic Model term weighting Options: re-weighting initial query terms; adding new terms w/wo re-weigthing initial query terms 12 6

Pseudo Relevance Feedback in Language Model (from: Manning based on Viktor Lavrenko and Chengxiang Zhai) Document D Query Q θ D θ Q D θ θ ) ( Q D Results α=0 θ Q ' = θ Q θ ' = (1 α) θ + αθ Q No feedback α=1 Q θ Q ' = θ F F Full feedback θ F Feedback Docs F={d 1, d 2,, d n } Relevance Feedback Modificions Various techniques can be used to improve the relevance feedback process. Number of Top-Ranked Documents Number of Feedback Terms Feedback Term Selection Techniques Iterions Term Weighting Phrase versus single term Document Clustering Relevance Feedback Thresholding Term Frequency Cutoff Points Query Expansion Using a Thesaurus 14 7

Relevance Feedback Justificion Improvement from relevance feedback, nidf weights 0.50 0.45 0.40 0.35 Precision 0.30 0.25 0.20 0.15 0.10 nidf, no feedback 0.05 0.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 nidf, feedback 10 terms Recall 15 Number of Top-Ranked Documents Recall-Precision for varying numbers of top-ranked documents with 10 feedback terms 0.70 0.60 0.50 Precision 0.40 0.30 0.20 1 document 5 documents 0.10 10 documents 20 documents 0.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 30 documents 50 documents Recall 16 8

Number of Feedback Terms Recall-Precision for varying numbers of feedback terms with 20 top-ranked documents Precision 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 nidf, no feedback 0.1 0.05 0 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 nidf, feedback 50 w ords+20 phrases nidf, feedback 10 terms Recall 17 Summary of Relevance Feedback Pro Relevance feedback usually improves average precision by increasing the number of good terms in the query (generally 10-15% improvement) Con More computional work Easy to decrease Precision (one horrible word can undo the good caused by lots of good words). 18 9

Thesauri It is intuitive to use thesauri to expand a query to enhance the accuracy. A query about dogs might well be expanded to include canine if a thesauri was consulted. Problem: easily a bad word can be added. A synonym for dog might well be pet and then the query would be too generic. 19 Manual vs. Automic Manual Use a readily available machine-readable form of a thesauri (e.g. Roget s, etc.). Automic build a thesaurus automically in a language independent fashion Notion is th an algorithm th could build a thesaurus automically could be used on many different languages. 20 10

Thesaurus Generion with Term Co-occurrence Thesaurus is genered by finding similar terms. terms th co-occur with each other over a threshold are considered similar. Term-Term similarity mrix is creed, having SC between every term t i with t j 21 Term Co-occurrence (example) Term Vectors (term-doc mapping): t 1 < 1 1> t 2 <0 1> SC (t 1, t 2 )= < 0 1>. < 1 1> = 1 dot product SC (t 1, t 2 )= SC (t 2, t 1 ) symmetric coefficient 22 11

Expanding Query using Term Co-occurrence For a given term t i, the top t similar terms are picked. These words can now be used for query expansion. Problems: A very frequent term will co-occur with everything Very general terms will co-occur with other general terms (hairy will co-occur with furry) 23 Semantic Networks Attempt to resolve the mismch problem Instead of mching query terms and document terms, measures the semantic distance Premise: Terms th share the same meaning are closer (smaller distance) to each other in semantic network See publicly available tool, WordNet (www.cogsci.princeton.edu/~wn) 24 12

Semantic Networks Builds a network th for each word shows its relionships to other words. (recent efforts incorpore phrases). For dog and canine a synonym arc would exist. To expand a query, find the word in the semantic network and follow the various arcs to other reled words. Different distance measures can be used to compute the distance from one word in the network to another. 25 WordNet based on Word Sense Disambiguion Survey by R. Navigli, ACM Computing Surveys, 2009 13

WordNet (An excerpt of domain labels taxonomy) based on Word Sense Disambiguion Survey by R. Navigli, ACM Computing Surveys, 2009 Types of Links in Wordnet Synonyms dog, canine Antonyms (opposite) night, day Hyponyms (is-a) dog, mammal Meronyms (part-of) roof, house Entailment (one entails the other) buy, pay Troponyms (two words reled by entailment must occur the same time) limp, walk 28 14

Summary Query expansion techniques, such as relevance feedback, Thesauri, WordNet (Semantic Network) can be used to find hopefully good words for users Using user intervention for the feedback improves the results 29 15