Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Similar documents
Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

Lecture 7: Relevance Feedback and Query Expansion

Modern Information Retrieval

Chapter 6: Information Retrieval and Web Search. An introduction

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Information Retrieval and Web Search

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

Information Retrieval

Information Retrieval

Chapter 2. Architecture of a Search Engine

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 1)"

VK Multimedia Information Systems

Search Engine Architecture II

Query Refinement and Search Result Presentation

Information Retrieval. Techniques for Relevance Feedback

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Introduction to Information Retrieval

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Natural Language Processing

CS 6320 Natural Language Processing

Chapter 27 Introduction to Information Retrieval and Web Search

INFSCI 2140 Information Storage and Retrieval Lecture 6: Taking User into Account. Ad-hoc IR in text-oriented DS

Introduction to Information Retrieval

Information Retrieval. (M&S Ch 15)

Information Retrieval. Information Retrieval and Web Search

Refining searches. Refine initially: query. Refining after search. Explicit user feedback. Explicit user feedback

Relevance Feedback & Other Query Expansion Techniques

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL

Document Clustering: Comparison of Similarity Measures

Mining Web Data. Lijun Zhang

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Modern Information Retrieval

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

Boolean Model. Hongning Wang

Text Analytics (Text Mining)

Information Retrieval and Web Search

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Web Search Engine Question Answering

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

Mining Web Data. Lijun Zhang

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

CS54701: Information Retrieval

Birkbeck (University of London)

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Text Analytics (Text Mining)

Semantic Search in s

Automatic Document; Retrieval Systems. The conventional library classifies documents by numeric subject codes which are assigned manually (Dewey

Information Retrieval. hussein suleman uct cs

- Content-based Recommendation -

Data Modelling and Multimedia Databases M

Query reformulation CE-324: Modern Information Retrieval Sharif University of Technology

COMP6237 Data Mining Searching and Ranking

Information Retrieval: Retrieval Models

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Information Retrieval

vector space retrieval many slides courtesy James Amherst

THE WEB SEARCH ENGINE

Efficient query processing

A Document Graph Based Query Focused Multi- Document Summarizer

CS47300: Web Information Search and Management

modern database systems lecture 4 : information retrieval

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

A New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval

Clustering. Bruno Martins. 1 st Semester 2012/2013

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Content-based Recommender Systems

Tag-based Social Interest Discovery

Instructor: Stefan Savev

Query Suggestion. A variety of automatic or semi-automatic query suggestion techniques have been developed

A Survey on Postive and Unlabelled Learning

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term.

Feature selection. LING 572 Fei Xia

Document Expansion for Text-based Image Retrieval at CLEF 2009

Visualization and text mining of patent and non-patent data

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

Information Retrieval and Data Mining Part 1 Information Retrieval

Knowledge Discovery and Data Mining 1 (VO) ( )

Query Operations. Relevance Feedback Query Expansion Query interpretation

CS371R: Final Exam Dec. 18, 2017

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Collaborative Filtering and Recommender Systems. Definitions. .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information retrieval

NATURAL LANGUAGE PROCESSING

Clustering Results. Result List Example. Clustering Results. Information Retrieval

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

Contextual Information Retrieval Using Ontology-Based User Profiles

How do microarrays work

Query Reformulation for Clinical Decision Support Search

Clustering (COSC 488) Nazli Goharian. Document Clustering.

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Retrieval by Content. Part 3: Text Retrieval Latent Semantic Indexing. Srihari: CSE 626 1

Link Recommendation Method Based on Web Content and Usage Mining

Using Coherence-based Measures to Predict Query Difficulty

Clustering (COSC 416) Nazli Goharian. Document Clustering.

Information Retrieval

CS Introduction to Data Mining Instructor: Abdullah Mueen

Transcription:

Outline Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Lecture 10 CS 410/510 Information Retrieval on the Internet Query reformulation Sources of relevance for feedback Using relevance feedback Vector model Probabilistic model Clustering approaches Local analysis Global analysis Examples of relevance feedback and query expansion Inputs to document ranking other than similarity CS 510 Winter 2007 2 The basic problem Query Possible solutions Retrieve new documents Results Queries are not always successful New queries do not learn from previous failure CS 510 Winter 2007 3 Add new terms to the query (query expansion) Re-order the documents Re-weight terms CS 510 Winter 2007 4 How? Where do we get information for adding query terms or re-weighting documents? Use relevant documents to help find more relevant documents Assumption: Relevant documents are more similar to each other than they are to non-relevant documents How? How do we get relevant documents? 1. Ask the user Systems returns documents for initial query User marks which ones are relevant System uses as input to new query Assumption: user will recognize documents that are useful, or close to what he wants 2. Assume top-ranked documents are relevant; automatically reformulate query CS 510 Winter 2007 5 CS 510 Winter 2007 6 1

Relevance feedback: user Blind feedback (3)Relevance feedback (1) Query (2) Results (4) New Result list CS 510 Winter 2007 7 (1) Query (3) Results (2) Term reweighting based on top-ranked documents CS 510 Winter 2007 8 Using relevance feedback Vector model: use terms and term weights in relevant documents to reformulate the query Add new terms (query expansion) Re-weight existing terms Basic approach: Vec newq = Vec oldq + Vec posexamples Vec negexamples Relevance feedback: Vector model Three classic query reformulations: (α, β, γ originally set to 1) D r is the set of relevant docs; D n is the set of non-relevant docs β γ 1. Rocchio qm = α q + d D d r D n Dr Dn Divide positive & negative vectors by size of example sets 2. Ide q = α q + β d γ d m Dr Dn 3. Ide Dec-Hi qm = α q + β d max d D γ r Use only top-ranked non-relevant document non relevant ( d ) CS 510 Winter 2007 9 Vec newq = Vec oldq + Vec posexamples Vec negexamples CS 510 Winter 2007 10 Using relevance feedback Probabilistic model: use relevant documents to re-calculate probabilities of relevance Re-weight terms No query expansion CS 510 Winter 2007 11 Clustering approaches Cluster hypothesis: closely associated documents tend to be relevant to the same requests 1 Relevant to query A Relevant to query B Instead of using weights of individual terms in relevant documents, use relevant documents to describe of cluster of desirable documents 1 C.J vanrisbergen. Information Retrieval, Chapter 3. 1979. CS 510 Winter 2007 12 2

Clustering for query reformulation Local analysis Operate only on documents returned for the current query the local set Use term co-occurrences to select terms for query expansion Global analysis Use information from all the documents in the collection Local Clustering Use term co-occurrence data to calculate correlation between terms u, v: C u,v Add the n query terms with the highest C u,v where u is a term in the original query Reflects term clustering CS 510 Winter 2007 13 CS 510 Winter 2007 14 Local Clustering Association clusters: C u,v reflects frequency of co-occurrence in the same document Metric clusters: C u,v reflects proximity of co-occurrences (number of words between occurrences of u and v) in documents Scalar clusters: Calculate vectors of correlation values for all terms in local documents C u,v reflects similarity of vectors (e.g. cosine of angle between vectors for u and v) Local Context Analysis Use noun groups (one or more adacent nouns in text) not ust keywords: concepts 1. Retrieve top n-ranked passages with original query 2. Calculate similarity between query and each concept (noun group) Reflects co-occurrence of query terms and concepts in passages 3. Add top m-ranked concepts to query, weighted by similarity ranking CS 510 Winter 2007 15 CS 510 Winter 2007 16 Global analysis Similarity thesaurus Statistical thesaurus CS 510 Winter 2007 17 Creating a similarity thesaurus Index a term based on documents it appears in each term associated with a vector k i = ( wi, 1, wi,2,..., wi, N ) where w i, is the weight of document associated with term i w i, is calculated similarly to tf-idf except it uses inverse term frequency (not inverse document freq.) length of vector is number of documents in collection Calculate a matrix of term correlations = k k = w w c u, v u v d u, v, CS 510 Winter 2007 18 3

Using a similarity thesaurus Compare candidate terms to the query concept or centroid Calculate virtual query vector that combines vectors for each query term, weighted by w i,q Previous uses of global analysis added terms based on correlation with individual query terms Calculate similarity of each term in collection to the query concept Add top r terms to original query Weighted by similarity to original query CS 510 Winter 2007 19 Creating a statistical thesaurus Based on clustering documents Algorithm produces hierarchical clusters Uses cosine formula from vector space model Classes derived from clusters and 3 parameters Similarity threshold (want tight clusters) Cluster size (want small clusters) Minimum inverse document frequency Only want to use low frequency terms Weight the thesaurus classes average term weight/number of terms in class CS 510 Winter 2007 20 Using a statistical thesaurus Relevance feedback on the Web Terms within a document class are considered related Documents and queries are indexed (augmented) using the thesaurus classes CS 510 Winter 2007 21 CS 510 Winter 2007 22 Query expansion on the Web Query expansion on the Web GoogleSuggest FAQs: That's pretty cool. How does it do that? Our algorithms use a wide range of information to predict the queries users are most likely to want to see. For example, Google Suggest uses data about the overall popularity of various searches to help CS rank 510 Winter the refinements 2007 it offers. 23 CS 510 Winter 2007 24 4

Relevance feedback What if we search using the medical term, varicella? Automatically expands search to include the corresponding MeSH term CS 510 Winter 2007 25 CS 510 Winter 2007 26 Textpresso Text mining application for scientific literature Can search terms or categories of terms Biological concepts (gene, cell, nucleic acid) Relationships (association, regulation) Biological processes Can search for gene or do query expansion to search for any gene name Specify a biologic concept Expand from category to instances CS 510 Winter 2007 27 CS 510 Winter 2007 28 View matches CS 510 Winter 2007 29 CS 510 Winter 2007 30 5

Document ranking Using considerations besides similarity Filters Ranking adustments Possible filters Language (English, Spanish, etc) Source Other metadata Novelty Suppress documents already seen, duplicates, near-duplicates from same site CS 510 Winter 2007 31 CS 510 Winter 2007 32 Possible ranking inputs Inputs other than similarity Importance (page rank) Popularity Reading level Currency Quality What is quality? What are the criteria? Who applies the criteria? Next: Classification and clustering CS 510 Winter 2007 33 CS 510 Winter 2007 34 6