TREC-3 Ad Hoc Retrieval and Routing. Experiments using the WIN System. Paul Thompson. Howard Turtle. Bokyung Yang. James Flood

Similar documents
An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst

Siemens TREC-4 Report: Further Experiments with Database. Merging. Ellen M. Voorhees. Siemens Corporate Research, Inc.

Amit Singhal, Chris Buckley, Mandar Mitra. Department of Computer Science, Cornell University, Ithaca, NY 14853

TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University

James P. Callan and W. Bruce Croft. seven elds describing aspects of the information need: the information need that is related to, but often distinct

CS54701: Information Retrieval

Information Retrieval Research

Real-time Query Expansion in Relevance Models

AT&T at TREC-6. Amit Singhal. AT&T Labs{Research. Abstract

30000 Documents

Probabilistic Learning Approaches for Indexing and Retrieval with the. TREC-2 Collection

Mercure at trec6 2 IRIT/SIG. Campus Univ. Toulouse III. F Toulouse. fbougha,

A Practical Passage-based Approach for Chinese Document Retrieval

EXPERIMENTS ON RETRIEVAL OF OPTIMAL CLUSTERS

Robust Relevance-Based Language Models

Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16

RMIT University at TREC 2006: Terabyte Track

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND

ResPubliQA 2010

Report on TREC-9 Ellen M. Voorhees National Institute of Standards and Technology 1 Introduction The ninth Text REtrieval Conf

Building Test Collections. Donna Harman National Institute of Standards and Technology

UMass at TREC 2017 Common Core Track

Inter and Intra-Document Contexts Applied in Polyrepresentation

Document Structure Analysis in Associative Patent Retrieval

Indexing and Query Processing

Report on the TREC-4 Experiment: Combining Probabilistic and Vector-Space Schemes

Using Statistical Properties of Text to Create. Metadata. Computer Science and Electrical Engineering Department

Improving the Effectiveness of Information Retrieval with Local Context Analysis

From Passages into Elements in XML Retrieval

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

M erg in g C lassifiers for Im p ro v ed In fo rm a tio n R e triev a l

Indri at TREC 2005: Terabyte Track (Notebook Version)

1 Introduction The history of information retrieval may go back as far as According to Maron[7], 1948 signies three important events. The rst is

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Retrieval and Feedback Models for Blog Distillation

nding that simple gloss (i.e., word-by-word) translations allowed users to outperform a Naive Bayes classier [3]. In the other study, Ogden et al., ev

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks

A Study on Query Expansion with MeSH Terms and Elasticsearch. IMS Unipd at CLEF ehealth Task 3

second_language research_teaching sla vivian_cook language_department idl

Inference Networks for Document Retrieval. A Dissertation Presented. Howard Robert Turtle. Submitted to the Graduate School of the

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany.

2. PRELIMINARIES MANICURE is specically designed to prepare text collections from printed materials for information retrieval applications. In this ca

QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

A Formal Approach to Score Normalization for Meta-search

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

Making Retrieval Faster Through Document Clustering

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Context-Based Topic Models for Query Modification

where NX qtf i NX = 37:4 ql :330 log dtf NX i dl + 80? 0:1937 log ctf i cf (2) N is the number of terms common to both query and document, qtf

Information Retrieval: Retrieval Models

Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization

A Patent Search and Classification System

A probabilistic description-oriented approach for categorising Web documents

Evaluating a Visual Information Retrieval Interface: AspInquery at TREC-6

Modeling Query Term Dependencies in Information Retrieval with Markov Random Fields

A Model for Information Retrieval Agent System Based on Keywords Distribution

Chinese track City took part in the Chinese track for the rst time. Two runs were submitted, one based on character searching and the other on words o

Using Temporal Profiles of Queries for Precision Prediction

Information Retrieval and Web Search

Query Modifications Patterns During Web Searching

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

Fondazione Ugo Bordoni at TREC 2004

Performance Measures for Multi-Graded Relevance

HARD Track Overview in TREC 2004 (Notebook) High Accuracy Retrieval from Documents

Indri at TREC 2005: Terabyte Track

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

A New Measure of the Cluster Hypothesis

Estimating Embedding Vectors for Queries

Server 1 Server 2 CPU. mem I/O. allocate rec n read elem. n*47.0. n*20.0. select. n*1.0. write elem. n*26.5 send. n*

Extracting Visual Snippets for Query Suggestion in Collaborative Web Search

An Investigation of Basic Retrieval Models for the Dynamic Domain Task

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms

A Deep Relevance Matching Model for Ad-hoc Retrieval

Window Extraction for Information Retrieval

Information Retrieval. (M&S Ch 15)

where w t is the relevance weight assigned to a document due to query term t, q t is the weight attached to the term by the query, tf d is the number

Homepage Search in Blog Collections

dr.ir. D. Hiemstra dr. P.E. van der Vet

Experiments on Related Entity Finding Track at TREC 2009 Qing Yang,Peng Jiang, Chunxia Zhang, Zhendong Niu

Fondazione Ugo Bordoni at TREC 2003: robust and web track

Automatic Term Mismatch Diagnosis for Selective Query Expansion

Application of k-nearest Neighbor on Feature. Tuba Yavuz and H. Altay Guvenir. Bilkent University

UMASS Approaches to Detection and Tracking at TDT2

CS646 (Fall 2016) Homework 1

DCU at FIRE 2013: Cross-Language!ndian News Story Search

Retrieval Evaluation

CS47300: Web Information Search and Management

characteristic on several topics. Part of the reason is the free publication and multiplication of the Web such that replicated pages are repeated in

The impact of query structure and query expansion on retrieval performance

The two successes have been in query expansion and in routing term selection. The modied term-weighting functions and passage retrieval have had small

An Attempt to Identify Weakest and Strongest Queries

Automatically Generating Queries for Prior Art Search

Term Frequency Normalisation Tuning for BM25 and DFR Models

A Cluster-Based Resampling Method for Pseudo- Relevance Feedback

Navigating the User Query Space

Investigate the use of Anchor-Text and of Query- Document Similarity Scores to Predict the Performance of Search Engine

Relevance Models for Topic Detection and Tracking

Transcription:

TREC-3 Ad Hoc Retrieval and Routing Experiments using the WIN System Paul Thompson Howard Turtle Bokyung Yang James Flood West Publishing Company Eagan, MN 55123 1 Introduction The WIN retrieval engine is West's implementation of the inference network retrieval model [Tur90]. The inference net model ranks documents based on the combination of dierent evidence, e.g., text representations, such as words, phrases, or paragraphs, in a consistent probabilistic framework [TC91]. WIN is based on the same retrieval model as the INQUERY system that has been used in previous TREC competitions [BCC93, Cro93, CCB94]. The two retrieval engines have common roots but have evolved separately { WIN has focused on the retrieval of legal materials from large (>50 gigabyte) collections in a commercial online environment that supports both Boolean and natural language retrieval [Tur94]. For TREC-3 we decided to run an essentially unmodied version of WIN to see how well a state-of-the-art commercial system compares to state-of-the-art research systems. Some modications to WIN were required to handle the TREC topics, which bear little resemblance to queries entered by online searchers. In general we used the same query formulation techniques used in the production WIN system with a preprocessor to select text from the topic in order to formulate a query. WIN was also used for routing experiments. Production versions of WIN do not provide routing or relevance feedback so we were less constrained by existing practice. However, we decided to limit ourselves to routing techniques that generated normal WIN queries. These routing queries could then be run using the standard search engine. In what follows, we will describe the conguration used for the experiments (Section 2) and the experiments that were conducted (Sections 3 and 4). 2 System Description The TREC-3 text collection was indexed in essentially the same way for both the ad hoc and routing experiments. Some elds within each document were not indexed; these elds include: CO, DESCRIPT, DOC, DOCID, DOCNO, FILEID, FIRST, G, GV, IN, MS, NS, RE, SECOND. These elds were excluded either because they contained manually indexed terms (which cannot be used under the TREC rules) or because the were considered to be 1

noise. A bounded paragraph algorithm [Cal94] was used to identify paragraph boundaries. Natural paragraphs were used subject to the constraint that a paragraph had to contain a minimum of 50 and a maximum of 200 words. All of the text not contained in these elds was indexed except for Federal Register documents. Federal Register documents tend to be very long and to contain a great deal of noise. In an attempt to identify text that was a reasonable description of document content we indexed only the "SUMMARY" paragraph if the document contained one, otherise we indexed only the rst kilobyte of text in a Federal Register document. Since no Federal Register documents were contained in the routing test collection all text except for the excluded elds was indexed. 3 Ad hoc experiments The ad hoc experiments used queries that were automatically created from the topic text. The retrieval algorithm used combined document and top paragraph scoring. It was observed that the a priori likelihood of relevance for a document varied from collection to collection. Furthermore each collection's likelihood of relevance given the value of domain eld, varied, as well. Some experiments were done in an attempt to exploit these observations. 3.1 Query Processing A WIN query consists of concepts extracted from natural language text. Rather than extracting concepts from the full topic only the Title eld, the Description eld, and the rst sentence of the Narrative eld were used. Each occurrence of a term, or concept, was counted and weighted by eld. A term appearing in Title was given a weight of 4, while terms appearing in Description and Narrative were given weights of 2 and 1, respectively. Normal WIN query processing eliminates introductory clauses and recognizes phrases and other important concepts for special handling. Many of the concepts ordinarily recognized by WIN are specic to the legal domain (e.g., legal citations, West Key Numbers) and were not used in these experiments. WIN ordinarily makes use of a dictionary of introductory clauses (e.g., \Find cases about : : :", \I'm interested in statutes that : : :") that don't bear directly on the content of the query. The set of introductory clauses was expanded to include 170 new clauses (e.g., \A relevant document must describe : : : ") identied in the Description and Narrative elds in the training set. In addition the string \e.g" was added to the set of query stopwords. WIN also expands some query terms automatically. For example \usa", \us", \u.s", and \united states" were all replaced with the synonym class #syn(ac:us #+1(united states)) that will conate common variants. Twenty nine new synonym classes were added for automatic expansion. WIN ordinarily uses a legal dictionary to nd phrases in queries. For TREC-3 the dictionary was expanded with phrases extracted from the machine-readable Collins Dictionary. The normal WIN dictionary incorporates information about how a phrase identied in a query is to be matched in document text. For example, query stopwords are generally not 2

AP DOE FR WSJ Zi Topics % of all documents 22.2 30.5 6.2 23.3 17.8 1-50 % of relevant docments 20.1 0.8 3.1 33.9 42.2 0.91 0.03 0.50 1.45 2.37 51-100 % of relevant docments 37.2 7.5 3.1 38.0 14.2 1.68 0.25 0.50 1.63 0.80 101-150 % of relevant docments 41.3 5.8 3.5 39.2 10.2 1.86 0.19 0.56 1.68 0.57 Total % of relevant docments 31.4 4.4 3.2 36.7 24.3 1.41 0.14 0.52 1.58 1.37 Table 1: Collection bias in relevance judgments considered to be signicant, but for some phrases (e.g., \at will") they are used. None of the phrases extracted from the Collins Dictionary used any special recognition features. 3.2 Experiments with dierent likelihoods of relevance based on collection In the TREC training set, the likelihood that a document will be judged relevant depends heavily on the collection in which it is found. Table 1 shows the distribution of documents among the ve TREC collections and the distribution of relevant documents among the ve collections. The AP collection, for example, contains 22.2% of all documents in the TREC collection, but it contains 31.4% of all relevant documents in the TREC collection. Table 1 shows that, for all topics, documents in two of the collections (DOE and Federal Register) are substantially less likely to be judged relevant as would be expected if there were no collection bias whereas documents from the Wall Street Journal, AP, and Zi collections are much more likely to be judged relevant than expected. Table 1 also shows that the distribution of relevant documents among the collections varies for dierent topic sets. For example, Zi documents are much more likely to be judged relevant than expected for Topics 1-50, but less likely than expected for the remaining two topic sets. A set of experiments was conducted in which the prior probability of relevance was set to the observed probability of relevance for each of the TREC collections rather than a default probability that was the same for all documents. This essentially biased retrieval in favor of AP, Wall Street Journal, and Zi documents and against DOE and Federal Register documents. These experiments showed a slight drop in retrieval eectiveness because the priors computed for the entire topic set rarely match the priors computed for individual topics. A second set of experiments was conducted to determine whether it would be possible to predict the appropriate collection biases based on the characteristics of individual topics. Approaches were tried using both the language contained in the topics and the domain eld contained in many of the training topics (note, however, that the test topics do not contain domain elds). None of these approaches signicantly improved performance, but the amount of eort devoted to these experiments was limited. We regard this as a promising 3

line of future research. 4 Routing Experiments The routing experiments used the same techniques as the ad hoc experiments to index the text collection, except that idf values were derived dierently. Since the test collection was to be used as a simulation of routing, the TREC guidelines do not allow use of any collection wide statistics, such as idf. Accordingly, the idf values from the CD-1 training set were used instead. Query processing, or prole creation, however, was done in a substantially dierent manner. No attempt was made to use the observed likelihoods of relevance of dierent collections, as was done with the ad hoc queries. The routing experiments were based on query expansion. No term reweighting was done. 4.1 Prole Processing As was the case with the ad hoc queries, only certain portions of the topic text were used for prole creation. These were the Title and Concepts elds. As before, each occurrence of a term, or concept, was counted and also weighted by eld. A term appearing in the Title eld was given a weight of 2, while a term appearing in the Concepts eld received a weight of 1. Any term not appearing in any of the relevant training documents, was removed. Consideration was given to increasing the weight of any term appearing in relevant, but not in irrelevant documents. This had no eect. Only one term met this condition. As a form of normalization the maximum weight that a term could attain was set. This weight was variously set at 5, 6, 7, and 8. This maximum included the contribution provided by the term expansion process, which was always 1 for a selected term, or 0 for a non-selected term (see below). A term might appear multiple times in the Concepts eld, thus resulting in a unnormalized term weight that exceeded the maximum. None of the usual WIN query formulation aids used with ad hoc queries (elimination of introductory clauses, use of replacement strings, and use of a phrase dictionary) were used for proles. The Title and Concepts elds did not contain any introductory phrases, or clauses. Simple acronyms, such as \RISC" or \MIPS" that were found in the text of relevant training documents during query expansion were identied as acronyms in the proles, so that they would be treated as instances of the same concept in subsequent processing. 4.2 Query expansion The focus of the routing experiments was on query expansion. Three dierent approaches to query expansion were used: \best entire document", \best rntidf top 200 paragraphs", and \best rntidf top paragraph". Ultimately these three approaches were combined in the \best overall" approach. The approaches were themselves based on three methods of term selection: \rddf", \rntidf", and \rtdf" [HC93]. The rddf score of a term was calculated by multiplying its idf value by the number of relevant training documents in which the term occurred. The rntidf score for a term was calculated by multiplying its idf value by the summation over all relevant training documents of the ratio of the term's frequency to 4

the frequency of the maximally-occurring term for that particular document. The rtdf score was simply the multiplication of the term's idf value by the number of occurrences of the term within relevant training documents. The rtdf score did not perform as well as the other two term selection methods, and so was not used as part of the nal runs. For each term selection method, terms selected were those with the highest scores. Terms were only selected from relevant documents. The term scores were only used for term selection, not for term reweighting. A selected term was given a weight of 1 in the expanded query. If the term duplicated a term already represented in the topic prole, then 1 was added to that term's current score. A baseline run was made using the prole creation process described above, but with no term expansion. Each of the three expansion approaches was then run with terms added by one or both of the remaining term selection methods, i.e., excluding rtdf. For each approach runs were made with from 5 to 50 terms added, in increments of 5. The \best entire document" approach used both the rddf and the rndf methods of term selection, with terms selected from any part of the document. The term selection that performed better on the training set was selected on a topic per topic basis. With the \best rntidf top 200 paragraphs", and the \best rntidf top paragraph" approaches, as their names imply, only the rntidf method was used, as it provided better results. For the \best rntidf top 200 paragraphs" approach searches were done for each topic using the baseline, i.e., unexpanded, prole as a query against the training collection. For each topic the top 200 scoring paragraphs from relevant documents were identied, using the WIN paragraph scoring method. Terms were then selected from these paragraphs using the rntidf method, rather than from the entire text of the relevant documents. For the \best rntidf top paragraph" approach a similar procedure was followed, except that instead of using the top 200 paragraphs from any relevant documents, the top scoring paragraph of each relevant document was used as a source for rntifd term selection. For each of the three approaches the maximum weight allowed for a term, i.e., 5, 6, 7, or 8 (see section 4.1), on a topic by topic basis, was the weight that gave the best performance on the training set. Finally, the method of query expansion used on the ocially submitted run, \best overall", was a combination of the three approaches described above. This method was to select the best query expansion provided by any of the three approaches on a topic per topic basis. Rather than simply selecting the best approach per topic in this manner, some consideration was given to trying to combine the results of the dierent methods [FS94, BKCQ94], but no experiments were carried out. 5 Summary WIN was able to achieve strong performance on both the ad hoc retrieval and routing tasks without any major modications being made to its retrieval engine. The ad hoc results show the eectiveness of its basic indexing and retrieval operations. Some techniques that were expected to give improved performance, did not lead to much improvement. In some cases this may be because only limited investigations could be done, e.g., when using the collection-dependent likelihood of relevance. In other cases, such as the failure of phrases, to yield much improvement, the result may indicate the diculty in eective use of a feature 5

which has given good results on smaller collections on a collection the size of the TREC collection [Har93]. References [BCC93] John Broglio, James P. Callan, and W. Bruce Croft. INQUERY system overview. In Proceedings of the TIPSTER Text Program (Phase 1) Workshop, pages 47{67, Morgan Kaufmann, September 1993. ISBN:1-55860-337-9. [BKCQ94] N. J. Belkin, P. Kantor, C. Cool, and R. Quatrain. Combining evidence for information retreival. In Donna K. Harman, editor, The Second Text Retrieval Conference (TREC-2), pages 35{44, National Institute of Standards and Technology, March 1994. Proceedings available as NIST Special Publication 500-215. [Cal94] [CCB94] James P. Callan. Passage-level evidence in document retrieval. In W. Bruce Croft and C. J. van Rijsbergen, editors, Proceedings of the Seventeenth Annual International Conference on Research and Development in Information Retrieval, pages 212{221, Springer-Verlag, London, July 1994. W. Bruce Croft, Jamie Callan, and John Broglio. TREC-2 routing and adhoc retrieval evaluation using the INQUERY system. In Donna K. Harman, editor, The Second Text Retrieval Conference (TREC-2), pages 75{84, National Institute of Standards and Technology, March 1994. Proceedings available as NIST Special Publication 500-215. [Cro93] W. Bruce Croft. The University of Massachusetts TIPSTER project. In Donna K. Harman, editor, The First Text Retrieval Conference (TREC-1), pages 101{105, National Institute of Standards and Technology, March 1993. Proceedings available as NIST Special Publication 500-207. [FS94] Edward A. Fox and Joseph A. Shaw. Combination of multiple searches. In Donna K. Harman, editor, The Second Text Retrieval Conference (TREC-2), pages 243{252, National Institute of Standards and Technology, March 1994. Proceedings available as NIST Special Publication 500-215. [Har93] [HC93] [TC91] Donna Harman. Document detection summary of results. In Proceedings of the TIPSTER Text Program (Phase 1) Workshop, pages 33{46, Morgan Kaufmann, September 1993. ISBN:1-55860-337-9. David Haines and W. Bruce Croft. Relevance feedback and inference networks. In Robert Korfhage, Edie Rasmussen, and Peter Willett, editors, Proceedings of the Sixteenth International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2{11, June 1993. Howard Turtle and W. Bruce Croft. Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3):187{222, July 1991. 6

[Tur90] [Tur94] Howard Turtle. Inference Networks for Document Retrieval. PhD thesis, Computer Science Department, University of Massachusetts, Amherst, MA 01003, 1990. Available as COINS Technical Report 90-92. Howard Turtle. Natural language vs. Boolean query evaluation: a comparison of retrieval performance. In W. Bruce Croft and C. J. van Rijsbergen, editors, Proceedings of the Seventeenth Annual International Conference on Research and Development in Information Retrieval, pages 212{221, Springer-Verlag, London, July 1994. 7