Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

Similar documents
Information Retrieval and Web Search

Boolean Queries. Keywords combined with Boolean operators:

Query Operations. Relevance Feedback Query Expansion Query interpretation

CS47300: Web Information Search and Management

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

Information Retrieval. (M&S Ch 15)

Lecture 7: Relevance Feedback and Query Expansion

Relevance Feedback & Other Query Expansion Techniques

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Information Retrieval and Web Search

Web Information Retrieval using WordNet

Information Retrieval. Information Retrieval and Web Search

CS 6320 Natural Language Processing

Information Retrieval. Techniques for Relevance Feedback

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Query reformulation CE-324: Modern Information Retrieval Sharif University of Technology

CS47300: Web Information Search and Management

Query Suggestion. A variety of automatic or semi-automatic query suggestion techniques have been developed

CS54701: Information Retrieval

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Information Retrieval

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Information Retrieval Exercises

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Making Sense Out of the Web

CMPSCI 646, Information Retrieval (Fall 2003)

Representation of Documents and Infomation Retrieval

Lecture 4: Query Expansion

The IR Black Box. Anomalous State of Knowledge. The Information Retrieval Cycle. Different Types of Interactions. Upcoming Topics.

QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL

Web Query Translation with Representative Synonyms in Cross Language Information Retrieval

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

Modern Information Retrieval

Research Article Relevance Feedback Based Query Expansion Model Using Borda Count and Semantic Similarity Approach

TABLE OF CONTENTS. ImageSilo Quick Guide Page 2 of 19

Text Mining. Munawar, PhD. Text Mining - Munawar, PhD

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Chapter 27 Introduction to Information Retrieval and Web Search

Improving the Precision of Web Search for Medical Domain using Automatic Query Expansion

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Chapter 6: Information Retrieval and Web Search. An introduction

A Comprehensive Analysis of using Semantic Information in Text Categorization

Introduction to Information Retrieval

Introduction to Information Retrieval

CHAPTER 3 DYNAMIC NOMINAL LANGUAGE MODEL FOR INFORMATION RETRIEVAL

Sec. 8.7 RESULTS PRESENTATION

Hybrid Approach for Query Expansion using Query Log

Query reformulation CE-324: Modern Information Retrieval Sharif University of Technology

Improving the Effectiveness of Information Retrieval with Local Context Analysis

COMP90042 LECTURE 3 LEXICAL SEMANTICS COPYRIGHT 2018, THE UNIVERSITY OF MELBOURNE

CS54701: Information Retrieval

CS47300: Web Information Search and Management

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms

Information Retrieval

An ontology-based approach for semantics ranking of the web search engines results

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Improving Retrieval Experience Exploiting Semantic Representation of Documents

Key-value stores. Berkeley DB. Bigtable

CSCI 5417 Information Retrieval Systems. Jim Martin!

CS580 - Query Processing. Build Support Word and Dispute Word Dictionary

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Keyphrase Extraction-Based Query Expansion in Digital Libraries

IN4325 Query refinement. Claudia Hauff (WIS, TU Delft)

NATURAL LANGUAGE PROCESSING

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

INFSCI 2140 Information Storage and Retrieval Lecture 6: Taking User into Account. Ad-hoc IR in text-oriented DS

CS371R: Final Exam Dec. 18, 2017

Information Retrieval. hussein suleman uct cs

Towards Understanding Latent Semantic Indexing. Second Reader: Dr. Mario Nascimento

Question Answering Approach Using a WordNet-based Answer Type Taxonomy

R 2 D 2 at NTCIR-4 Web Retrieval Task

Limitations of XPath & XQuery in an Environment with Diverse Schemes

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

THE WEB SEARCH ENGINE

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Punjabi WordNet Relations and Categorization of Synsets

Mining Term Association Rules for Global Query Expansion: A Case Study with Topic 202 from TREC4

Consistency Learning and Multiple Rankings Combination for Text Retrieval

Reading group on Ontologies and NLP:

Wordnet Based Document Clustering

Evaluation of Retrieval Systems

Midterm Exam Search Engines ( / ) October 20, 2015

Automatic Construction of WordNets by Using Machine Translation and Language Modeling

A. The following is a tentative list of parts of speech we will use to match an existing parser:

Query Expansion Based on Crowd Knowledge for Code Search

Query Expansion Study for Clinical Decision Support

Q: Given a set of keywords how can we return relevant documents quickly?

Estimating Embedding Vectors for Queries

VK Multimedia Information Systems

Information Retrieval. Session 11 LBSC 671 Creating Information Infrastructures

Open Source Search. Andreas Pesenhofer. max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria

Chapter 3 - Text. Management and Retrieval

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

USING SEMANTIC REPRESENTATIONS IN QUESTION ANSWERING. Sameer S. Pradhan, Valerie Krugler, Wayne Ward, Dan Jurafsky and James H.

RE-RANKING OF IMAGES USING KEYWORD EXPANSION BASED ON QUERY LOG & FUZZY C-MEAN CLUSTERING

Transcription:

Optimal Query Assume that the relevant set of documents C r are known. Then the best query is: q opt 1 C r d j C r d j 1 N C r d j C r d j Where N is the total number of documents. Note that even this query may return some irrelevant documents if the relevant and irrelevant are not linearly separable 1

2

Relevance Feedback Hypothesis: initial query only approximates information need (i.e., the optimal query) So: 1)Get an initial query 2)Retrieve documents according to query 3)Get indication of [ir]relevant documents 4)Reformulate the query. 5)Goto step 2 Assumes relevance judgements are binary Other things? 3

Relevance Feedback Architecture Query String Document corpus Revised Query IR System Rankings ReRanked Documents Query Reformulation Feedback 1. Doc1 2. Doc2 3. Doc3.. Ranked Documents 1. Doc1 2. Doc2 3. Doc3.. 1. Doc2 2. Doc4 3. Doc5.. 4

Standard Rochio Method Since all relevant documents unknown, just use the known relevant (D r ) and irrelevant (D n ) sets of documents and include the initial query q. β q m α q D r d j D r d j γ D n d j D n d j Î: Tunable weight for initial query. Î 2 :Tunable weight for relevant documents. Î 3 : Tunable weight for irrelevant documents. 5

Improving Rocchio Ide Regular Method Idea -- since more feedback should perhaps increase the degree of reformulation, do not normalize for amount of feedback: q m α q β d j D r d j γ d j D n d j Î: Tunable weight for initial query. Î 2 :Tunable weight for relevant documents. Î 3 : Tunable weight for irrelevant documents. 6

More Improvements Ide dec Hi msthod Bias towards rejecting just the highest ranked of the irrelevant documents: q m α q β d j D r d j γ max non relevant d j Î: Tunable weight for initial query. Î 2 :Tunable weight for relevant documents. Î 3 : Tunable weight for irrelevant documents. How is this like query zoning from Schapire's paper? 7

Yet More Improvements Schapire's 3 Improved term weighting Replace standard tf-idf weights with weights based on a probabilistic model of word usage in context. The words get credit for interesting usage not simply appearance. (Stop words are rarely used in interesting ways) Query Zoning Idea is to use only those examples marked as irrelevant that are close to the relevant i.e., near misses Dynamic Feedback Optimization Change term weights using feedback so the values of words change with feedback 8

9

Scharpire's weak learners Derive a rule of thumb which is a single word or phrase that distinguishes that can be used with some reliability to split yes from no. For example, if trying to sort cs pages from non-cs pages a weak hypothesis might be the presence of word computer 10

AdaBoost 11

Comparison of Methods Overall, experimental results indicate no clear preference for any one of the specific methods. All methods generally improve retrieval performance (recall & precision) with feedback. Generally just let tunable constants equal 1. 12

Comparing Rocchio+ to AdaBoost Results are about the same AdaBoost T alpha s h s d s=1 Rocchio h s d Where h_s is a weak learner hypothesis for adaboost and a roochio vector space for Rocchio. But, importantly both methods learn a single separating hyperplane Rocchio will tend to have lots of small, nonzero terms whereas adaboost will have few nonzero terms. But they learn a fundamentally similar function, so it is not surprising that they are about equally effective 13

Evaluating Relevance Feedback By construction, reformulated query will rank explicitly-marked relevant documents higher and explicitly-marked irrelevant documents lower. Method should not get credit for improvement on these documents, since it was told their relevance. In machine learning, this error is called testing on the training data. Evaluation should focus on generalizing to other un-rated documents. 14

15

Fair Evaluation of Relevance Feedback Remove from the corpus any documents for which feedback was provided. Measure recall/precision performance on the remaining residual collection. Compared to complete corpus, specific recall/precision numbers may decrease since relevant documents were removed. However, relative performance on the residual collection provides fair data on the effectiveness of relevance feedback. 16

Why is Feedback Not Widely Used Users sometimes reluctant to provide explicit feedback. Results in long queries that require more computation to retrieve, and search engines process lots of queries and allow little time for each one. Makes it harder to understand why a particular document was retrieved. 17

Pseudo Feedback Use relevance feedback methods without explicit user input. Just assume the top m retrieved documents are relevant, and use them to reformulate the query. Allows for query expansion that includes terms that are correlated with the query terms. 18

Pseudo Feedback Architecture Query String Document corpus Revised Query IR System Rankings ReRanked Documents Query Reformulation Pseudo Feedback 1. Doc1 2. Doc2 3. Doc3.. Ranked Documents 1. Doc1 2. Doc2 3. Doc3.. 1. Doc2 2. Doc4 3. Doc5.. 19

PseudoFeedback Results Found to improve performance on TREC competition ad-hoc retrieval task. Works even better if top documents must also satisfy additional boolean constraints in order to be used in feedback. 20

Thesaurus-based Query Expansion For each term, t, in a query, expand the query with synonyms and related words of t from the thesaurus. May weight added terms less than original query terms. Generally increases recall. May significantly decrease precision, particularly with ambiguous terms. interest rate interest rate fascinate evaluate 21

WordNet A more detailed database of semantic relationships between English words. About 144,000 English words. Nouns, adjectives, verbs, and adverbs grouped into about 109,000 synonym sets called synsets. http://www.cogsci.princeton.edu/cgibin/webwn 22

WordNet Synset Relationships Antonym: front - back Attribute: benevolence - good (noun to adjective) Pertainym: alphabetical - alphabet (adjective to noun) Similar: unquestioning - absolute Cause: kill - die Entailment: breathe - inhale Holonym: chapter - text (part-of) Meronym: computer - cpu (whole-of) Hyponym: tree - plant (specialization) Hypernym: fruit - apple (generalization) 23

WordNet Query Expansion Add synonyms in the same synset. Add hyponyms to add specialized terms. Add hypernyms to generalize a query. Add other related terms to expand query. 24

Wordnet Expansion Results From Voorhees (1994) the effectiveness of hand-selected WordNet synsets on precision and recall of 50 TREC queries 25

Query Expansion Conclusions Expansion of queries with related terms can improve performance, particularly recall. However, must select similar terms very carefully to avoid problems, such as loss of precision. 26