X. A Relevance Feedback System Based on Document Transformations. S. R. Friedman, J. A. Maceyak, and S. F. Weiss

Similar documents
Automatic Document; Retrieval Systems. The conventional library classifies documents by numeric subject codes which are assigned manually (Dewey

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Different Optimal Solutions in Shared Path Graphs

Chapter 2 Basic Structure of High-Dimensional Spaces

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose

CMPSCI 646, Information Retrieval (Fall 2003)

Robust Relevance-Based Language Models

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

Character Recognition

A Self-Organizing Binary System*

Citation for published version (APA): He, J. (2011). Exploring topic structure: Coherence, diversity and relatedness

Neuro-fuzzy admission control in mobile communications systems

INFSCI 2140 Information Storage and Retrieval Lecture 6: Taking User into Account. Ad-hoc IR in text-oriented DS

Ccontainers and protective covers for a wide variety of products. In order. Stacking Characteristics of Composite. Cardboard Boxes

Powered Outer Probabilistic Clustering

An Active Learning Approach to Efficiently Ranking Retrieval Engines

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany.

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval

Unit 7 Number System and Bases. 7.1 Number System. 7.2 Binary Numbers. 7.3 Adding and Subtracting Binary Numbers. 7.4 Multiplying Binary Numbers

Rational Number Operations on the Number Line

Parallel Performance Studies for a Clustering Algorithm

Conditional Volatility Estimation by. Conditional Quantile Autoregression

1. Introduction. 2. Motivation and Problem Definition. Volume 8 Issue 2, February Susmita Mohapatra

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

A Singular Example for the Averaged Mean Curvature Flow

A Survey on Postive and Unlabelled Learning

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

MVAPICH2 vs. OpenMPI for a Clustering Algorithm

CONSECUTIVE INTEGERS AND THE COLLATZ CONJECTURE. Marcus Elia Department of Mathematics, SUNY Geneseo, Geneseo, NY

Excerpt from "Art of Problem Solving Volume 1: the Basics" 2014 AoPS Inc.

Image Access and Data Mining: An Approach

Chapter Fourteen Bonus Lessons: Algorithms and Efficiency

Multiway Blockwise In-place Merging

Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

Using Arithmetic of Real Numbers to Explore Limits and Continuity

Discrete Applied Mathematics. A revision and extension of results on 4-regular, 4-connected, claw-free graphs

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE

A Miniature-Based Image Retrieval System

Combining PGMs and Discriminative Models for Upper Body Pose Detection

Graphical Analysis. Figure 1. Copyright c 1997 by Awi Federgruen. All rights reserved.

Math 182. Assignment #4: Least Squares

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL

Divisibility Rules and Their Explanations

Title: Increasing the stability and robustness of simulation-based network assignment models for largescale

ResPubliQA 2010

Parallel Evaluation of Hopfield Neural Networks

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Document Clustering for Mediated Information Access The WebCluster Project

SPERNER S LEMMA, BROUWER S FIXED-POINT THEOREM, AND THE SUBDIVISION OF SQUARES INTO TRIANGLES

A Path Decomposition Approach for Computing Blocking Probabilities in Wavelength-Routing Networks

How & Why We Subnet Lab Workbook

A COINCIDENCE BETWEEN GENUS 1 CURVES AND CANONICAL CURVES OF LOW DEGREE

General properties of staircase and convex dual feasible functions

HEURISTIC OPTIMIZATION USING COMPUTER SIMULATION: A STUDY OF STAFFING LEVELS IN A PHARMACEUTICAL MANUFACTURING LABORATORY

1. Use the Trapezium Rule with five ordinates to find an approximate value for the integral

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

SPSS INSTRUCTION CHAPTER 9

Chapter 6: Information Retrieval and Web Search. An introduction

AXIOMS OF AN IMPERATIVE LANGUAGE PARTIAL CORRECTNESS WEAK AND STRONG CONDITIONS. THE AXIOM FOR nop

Information Retrieval. (M&S Ch 15)

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

An Approach to Task Attribute Assignment for Uniprocessor Systems

Truncation Errors. Applied Numerical Methods with MATLAB for Engineers and Scientists, 2nd ed., Steven C. Chapra, McGraw Hill, 2008, Ch. 4.

Paul's Online Math Notes Calculus III (Notes) / Applications of Partial Derivatives / Lagrange Multipliers Problems][Assignment Problems]

Boolean Model. Hongning Wang

Sorting. Bubble Sort. Selection Sort

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

Tree-based methods for classification and regression

Towards Cohesion-based Metrics as Early Quality Indicators of Faulty Classes and Components

Modeling Plant Succession with Markov Matrices

Content-Based Image Retrieval of Web Surface Defects with PicSOM

Two-dimensional Totalistic Code 52

Hierarchical Minimum Spanning Trees for Lossy Image Set Compression

Precedence Graphs Revisited (Again)

PV211: Introduction to Information Retrieval

The Great Mystery of Theoretical Application to Fluid Flow in Rotating Flow Passage of Axial Flow Pump, Part I: Theoretical Analysis

Optimal Detector Locations for OD Matrix Estimation

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

Random projection for non-gaussian mixture models

Minimum Risk Acceptance Sampling Plans: A Review

A Study of the Suitability of IrOBEX for High- Speed Exchange of Large Data Objects

Fundamentals of Operations Research. Prof. G. Srinivasan. Department of Management Studies. Indian Institute of Technology, Madras. Lecture No.

Rank Measures for Ordering

From Passages into Elements in XML Retrieval

Optimizing Shipping Lanes: 1.224J Problem Set 3 Solutions

Rate Distortion Optimization in Video Compression

High Capacity Reversible Watermarking Scheme for 2D Vector Maps

Application of Support Vector Machine Algorithm in Spam Filtering

On Biased Reservoir Sampling in the Presence of Stream Evolution

Advanced Combinatorial Optimization September 17, Lecture 3. Sketch some results regarding ear-decompositions and factor-critical graphs.

The application of Randomized HITS algorithm in the fund trading network

Bipartite Graph Partitioning and Content-based Image Clustering

IDENTIFICATION AND ELIMINATION OF INTERIOR POINTS FOR THE MINIMUM ENCLOSING BALL PROBLEM

Express Letters. A Simple and Efficient Search Algorithm for Block-Matching Motion Estimation. Jianhua Lu and Ming L. Liou

CSCI 5417 Information Retrieval Systems. Jim Martin!

THE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy

Transcription:

X-l X. A Relevance Feedback System Based on Document Transformations S. R. Friedman, J. A. Maceyak, and S. F. Weiss Abstract An information retrieval system using relevance feedback to modify the document space is proposed. A suitable algorithm is exhibited that achieves high precision and recall. 1. The Problem An information retrieval system can be viewed as consisting of a document collection together with procedures for accessing and retrieving the documents in the collection. For evaluation purposes, the main goal of such a system can be taken to be the retrieval of all documents in the collection relevant to a request presented by a user. Hence, the problem of retrieval is to relate the user's need to a collection of information. How efficiently this relation can be generated depends in general on how the document collection is organized. In traditional retrieval systems, i.e. libraries, books are grouped by subject matter according to some classification system such as the Dewey Decimal System or the Library of Congress System. Such classification systems divide human knowledge into categories such as physics, biology, etc. The virtue of all such classification systems is that books similar in content will often be located physically close to each other on the library shelves.

X-2 Libraries can be characterized as being static document spaces. That is, documents are assigned permanent positions in the document spaces. Static document spaces suffer from two major weaknesses. First, given a classification system, it is possible to construct a request such that documents relevant to that request may not be located close to each other. For example, the topic "aerodynamics of birds" might be discussed in books which are primarily concerned with aerodynamics and only slightly concerned with birds, or, alternatively in books which are primarily concerned with birds and only tangentially with their aerodynamics* Second, given a classification system, it is possible to imagine a document which is inherently difficult to classify. For example, should Churchfs Calculi of Lambda-conversion be placed with computing texts or with pure mathematics texts? The book relates to both areas. Hence, static document spaces impose a classification system on the user which may not always suit his needs. The userfs success with such a system varies as a function of how much his needs agree with the subject categories determined by the classification. Most automated retrieval systems have inherited the static document space. These systems are normally more flexible than libraries, in that subject categories are permitted to overlap. However, the fact remains that documents are assigned permanent positions in the space. Hence, as with libraries, documents relevant to a given request might lie far away from each other in the space. In the SMART system, the classification of a document is based on the importance of certain concepts in the document. A document is then represented by a vector whose elements are the relative weights of these concepts. A query is similarly represented. Hence, the query vector can

X-3 be thought of as positioned in the document space with access to all documents close to it. Since the user might not phrase his original query in the best possible manner, relevance feedback systems have been suggested to improve system operations. Most of these systems are based on successive modifications of the query vector using feedback information from the user. One such system described by Salton and Rocchio [h] has been implemented by Riddle, Horwitz, and Dietz. [2] The main weakness of the system lies in its inability to retrieve all documents relevant to a query when these documents do not lie close together in the document space. The present study addresses itself to the question of whether a relevance feedback system cam be devised that will retrieve such documents. The solution proposed is based on the use of a dynamic document space. Every user who submits a search request is given access to the standard document space. Subsequently, the document space is distorted in response to feedback information received from the user in an attempt to bring relevant documents closer to the query. Hence, the query is kept static, and the document space is altered. The standard document space is then no longer viewed as an absolute structure, but as an initial structure which can be altered to better suit the personal classification scheme of the user. An implementation of such a relevance feedback system on the CDC l6oh is described here. The system has been tested for the 82-document ADI collection using queries for which a priori relevance judgments are available. 2. The Implementation The initial task in implementing a relevance feedback system based

x-i* on document modification is to devise an algorithm that distorts the document space to bring relevant documents close to the query* The algorithm actually used is based on the assumption that concepts which are strong (highly weighted) in the documents judged relevant by the user and weak in the documents judged non-relevant tend to characterize relevant documents. Hence, the weights of these concepts should be raised in all documents in which they occur so as to increase the correlation between these documents and the query. Similarly, it is assumed that concepts which are strong in the non-relevant documents and weak in the relevant documents tend to characterize non-relevant documents and should be decreased in all documents in which they occur in order to lower the correlations. The problem remains of how the concept weights should be raised or lowered. It seems natural that the amount of alteration for a concept should depend on how much that concept characterizes the relevant or nonrelevant documents already seen by the user. That is, if a concept is judged to be a good positive discriminator (i.e. strong in relevant documents, weak in non-relevant documents), it should be raised by an amount proportional to how strong it is in the relevant documents. Similarly, concepts judged to be good negative discriminators (i.e. strong in nonrelevant documents, weak in relevant documents), should be lowered by an amount proportional to how strong they are in the non-relevant documents. It also seems desirable to insist that the amount of alteration of a concept weight for a document should be proportional to that weight. That is, if a concept which is marked as a good discriminator is very weak for a particular document, then it should not be altered much since it doesnft /

X-5 really characterize that document. On the other hand, if it is strong in the document, then it should be altered more since it provides a better characterization of the document. These considerations were used to determine the final form of the algorithm: a) Enter request. Retrieve the 10 documents having highest correlation with the query. b) Use a priori relevance judgments to determine relevant and non-relevant documents. c) Compute R., N., and D. where R. The sum over all relevant documents of the weights of concept i The number of relevant documents Ni Sum over all non-relevant documents of the weights of concept i The number of non-relevant documents D. = R. - N. i i i = The discrimination value of concept i. d) Choose those concepts which seem to be good discriminators between relevant and non-relevant documents, i.e. those for which D. > 5. To these concepts, add all concepts which occur in the query. e) For the i i for obtained in d), compute 1 2 ^ F., FT, F^ where 1 i _ Weight of concept i in query The sum of the weights of all concepts in query F2 i = Sum of the weights of concept i in relevant documents Sum of the weights of all concepts in relevant documents 3 i _ Sum of the weights of concept i in non-relevant documents Sum of the weights of all concepts in non-relevant documents

X-6 These three numbers are measures of the importance of concept i in the query, the relevant documents, and the non-relevant documents respectively. f) For the i obtained in d), compute T. where T. = ajs a F. T. -, if D < & and if the query does not have con- cept OLJH y if D > 8, or if the query has concept i. i. g) Alter the document space by changing the weights of the concepts i obtained in d ). w? < w. + w?t. if document relevant, where w? j j has not been judged non- is the weight of concept i for document and where the left arrow denotes assignment. If document j has been judged non-relevant, all its concept weights are set to zero. This step has the effect of moving documents judged non-relevant as far from the query as possible. h) Return to a) using the same query and the altered document space. Actually g) is performed during the search in step a). This loop is continued until the user is satisfied with his results or until he gives up. The algorithm makes use of three parameters: 8, QL, a 8 is the cut-off point for discrimination values. High values for 8 lower the number of good discriminators, a and OL are the respective weights given to the query and to the relevant and non-relevant documents to denote their importance in adjusting the document space. The algorithm was originally intended to use the cosine correlation function for correlating the query to documents. The cosine correlation

X-7 function is given by m E (qidi) i=l where m is the number of indexing concepts, q. and d are the weights of the i concept in the query and documents vectors respectively. However, early experiments with the algorithm gave better results with a modified version of the cosine function. In this version, the denominator is kept constant (i.e. equal to its value in the initial search) through all updates of the document space. This modification has the effect of magnifying all changes to the document vectors. It was decided to experiment with each of these correlation functions in the algorithm. The algorithm was implemented by modifying existing relevance feedback programs which used query altering. Queries from the ADI collection were processed using various choices for a, oc, 6 and correlation function. After processing 11 queries, the program obtains precision-recall curves averaged over all 11 queries. These curves are given in Figs. 1 through 6. 3. Experimental Results In general, modification of the document space using relevance feedback information leads to significantly better results than modification

x-8 of the query. The highest averages of normalized precision and normalized recall achieved by Riddle, Horwitz, and Dietz using query modification are 755^ and.8386 respectively. Corresponding averages using document mod- ification are.8576 and.8871 for the modified cosine correlation function and.7987 and.8806 for the standard cosine correlation function. This improvement can also be seen by comparing Figs. 1 and 7. Pig. 1 gives recall-precision curves averaged over 11 requests using document space modification. Fig. 7 gives typical corresponding curves for query modification. Tables 1 and 2 show the reason for this improvement. Table 1 lists the relevant documents retrieved by query modification and by document modification. For queries 11, 12, lk, and 15, document modification techniques retrieved more relevant documents. Table 2 gives the documents retrieved by both techniques for these queries and the number of concepts each such document has in common with the relevant documents retrieved by only document modification techniques. In most cases, this number is 1, indicating very little similarity between the documents retrieved by both techniques and those retrieved by only document modification. Hence, the improved results achieved by document modification are due to its ability to retrieve documents unlike each other but relevant to the query. Figs. 1 through 6 show the effect of different values of & on recall-precision curves averaged over 11 requests. The curves for the third iteration for 8 equal to 3 and 5 are almost identical and are better than the third iteration curves for & equal to U. Also, the curves for the second iteration for 6 equal to 3 are better than the curves for the second iteration for 6 equal to 5* Hence, smaller values of the relevant documents to converge faster to the query. 8 cause

x-9 Query- No. DOCUMENT QUERY MODIFICATION Retrieved Not Ret. NUMBERS DOC. MODIFICATION Retrieved Not Ret. 9 10 50,82 29,39,2 50,11 50,82 50,2,39 11,29 11 12 43 k 79,81,8 42,9,25,7 4 3,79 4,25,7,^2,9 81,8 13 Ik 1,1^,37,27 20 80,65 33 1,37,65 20,33 14,27,80 15 26 11,37 5 36,6,32,67 11,37,6 5 32,36,67 28 69,9,48,70 69,9>8,70 Comparison of relevant documents retrieved by query modification and document modification. Table 1 Query No. Retrieved by Query and Doc. Mod. Retrieved only by Doc. Mod. No. of Concepts in common 11 12 43 4 4 4 4 79 25 7 9 42 1 1 1 3 2 14 15 20 11 33 6 1 1 37 6 3 Relationship between documents retrieved by both techniques and those retrieved only by document modification. Table 2

X-10 Initial Search.0 - o o a _ A A * First Iteration *CL o o o Second Iteration 91 XQ D a D O Third Iteration 2.6h > \ (A g.3k *^# fc. ^A *.4r- N^ \ a o D ah - _ X.1 *«> J I I I I I L - -.5.6.7.8.9 1.0 Recoil a i * 2 " 1 ' 5 = 3> modified cosine Document Space Modification ADI Collection - Averages over 11 Requests Fig. 1

1.0 r o o^.9.8 V X ^ X.7 c.2.6 CO 5 s -.4.3.2 1 1 0 -~X V Xo v A \ «\ X 1 1 fr.1.2.3.4.5.6.7.8.9 1.0 Recall a = a = l, 5=3, modified cosine Document Space Modification Fig. 2

4 Oh D D 9k ^o^g H 8h A>^ b o 5 - A VN A 3 2 _l I I I I I I I I I».1.2.3.4.5.6.7.8.9 1.0 Recall a = a = l, 6 = li, modified cosine Document Space Modification Fig. 3

X-13 1.0.9^.8 X *N \ 7h \ O Q I I <>vx.2.1 J I I I I I L.2.3.4.5.6.7.8.9 1.0 Recall a. L = O. - 1, 5 = 3, standard cosine Document Space Modification Fig. 1*

A.Op 9 r 8 r 7 r 6 r 5 r 4 r 3 r 2 r.1.2.3 i i I I I 1 L.4.5.6.7.8.9 1.0 Recall a = a = 1, 5=5* standard cosine Document Space Modification Fig. 5

X-15 * o 1.0.9.8.7 o>.5 a..4.3.2.1 D -D D. sh - \ v\. ' A \, O J L J L J L_^.1.2.3.4.5.6.7.8.9 Recall a = a = 1, 8 = 11, standard cosine Document Space Modification Fig. 6

i.ok o_o o w Q 0 ^ O.2 \ ' ' ' ' ' ' ' I I L.1.2.3.4.5.6.7.8.91.0 Recall Typical results using query modification ADI Collection - 11 requests Fig. 7

X-17 Figs, k to 6 give average recall-precision curves achieved by using the standard cosine correlation function. Comparison with Figs. 1 to 3 shows that the modified cosine correlation function yields much better results. The modified cosine correlation function causes the relevant documents to converge faster to the query. In fact, as can be seen from the graphs, modified cosine achieves, on the average, better recall and precision after 2 iterations than standard cosine achieves after 3» Experiments performed on the effect of varying quite disappointing. oc and oc are The results are extremely erratic and do not seem to yield any general conclusions about the effects of oc particular, the use of different values of a and oc and oc. In for each iteration causes undesirable instabilities such as relevant documents being very nearly retrieved on the second iteration and then being moved far from the query on the third iteration. The results for oc and oc equal to 1 are the best ones obtained in this investigation. k* Conclusions The effectiveness of relevance feedback is enhanced by the use of a dynamic document space. Such a space enables relevant documents to be brought close to each other and close to the query. Moreover, this study exhibits a relevance feedback algorithm based on document modification that consistently achieves better recall and precision than the algorithm based on query alteration suggested by Salton and Rocchio, and implemented by Riddle, Horwitz, and Dietz.

X-18 These conclusions suggest further areas for investigation. First, there is a definite need for a theoretical study of document modification. The present study presents a particular algorithm that works well. However, this algorithm is not necessarily optimal. Theoretical investigation might uncover such an algorithm. Second, a study should be made of the efficiency of document modification. It is clearly a more time consuming method than query modification, since changes must be made to the entire document collection as opposed to being made only on the query. However, further investigation might show that document modification can be implemented without too great a loss in efficiency over query modification. In this case, the improved results achieved by document modification would justify the extra expenditure of computing time. Third, an investigation should be made into the possibility of using both query and document modification. The results achieved in this study might be further improved by such a hybrid technique.

x-19 References [l] E. M. Keen, "Evaluation of Relevance Feedback Methods", Report ISR-11 to the National Science Foundation, Cornell University, June 1966. [2] T. Horwitz, R. Dietz, 0. W. Riddle, "Relevance Feedback in an Information Retrieval System", Report ISR-11 to the National Science Foundation, June 1966. [3] J. J. Rocchio, "Harvard Doctoral Thesis", Report ISR-10 to the National Science Foundation, Harvard Computation Laboratory, June 1965. [h] G. Salton, J«J. Rocchio, "Information Search Optimization and Iterative Retrieval Techniques", AFIPS Conference Proceedings, Vol. 27, Part 1, FJCC, 1965* [5] G. Salton, Automatic Information Retrieval, Class Notes, Computer Science ^35, Cornell University.