Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany.

Similar documents
Probabilistic Learning Approaches for Indexing and Retrieval with the. TREC-2 Collection

Siemens TREC-4 Report: Further Experiments with Database. Merging. Ellen M. Voorhees. Siemens Corporate Research, Inc.

Mercure at trec6 2 IRIT/SIG. Campus Univ. Toulouse III. F Toulouse. fbougha,

AT&T at TREC-6. Amit Singhal. AT&T Labs{Research. Abstract

Information Retrieval Research

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst

A Balanced Term-Weighting Scheme for Effective Document Matching. Technical Report

Retrieval Quality vs. Effectiveness of Relevance-Oriented Search in XML Documents

Performance Measures for Multi-Graded Relevance

Rowena Cole and Luigi Barone. Department of Computer Science, The University of Western Australia, Western Australia, 6907

Block Addressing Indices for Approximate Text Retrieval. University of Chile. Blanco Encalada Santiago - Chile.

Pseudo-Relevance Feedback and Title Re-Ranking for Chinese Information Retrieval

A World Wide Web Resource Discovery System. Budi Yuwono Savio L. Lam Jerry H. Ying Dik L. Lee. Hong Kong University of Science and Technology

Telecommunication and Informatics University of North Carolina, Technical University of Gdansk Charlotte, NC 28223, USA

Chapter 6: Information Retrieval and Web Search. An introduction

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

A Prototype for Integrating Probabilistic Fact. and Text Retrieval

Motion Estimation. Original. enhancement layers. Motion Compensation. Baselayer. Scan-Specific Entropy Coding. Prediction Error.

Relevance of a Document to a Query

MeDoc Information Broker Harnessing the. Information in Literature and Full Text Databases. Dietrich Boles. Markus Dreger y.

Static Pruning of Terms In Inverted Files

Using Statistical Properties of Text to Create. Metadata. Computer Science and Electrical Engineering Department

Information Retrieval. (M&S Ch 15)

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

number of documents in global result list

TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

2. PRELIMINARIES MANICURE is specically designed to prepare text collections from printed materials for information retrieval applications. In this ca

A probabilistic description-oriented approach for categorising Web documents

Tilburg University. Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval

Query Expansion for Noisy Legal Documents

indexing and query processing. The inverted le was constructed for the retrieval target collection which contains full texts of two years' Japanese pa

A Model and a Visual Query Language for Structured Text. handle structure. language. These indices have been studied in literature and their

X. A Relevance Feedback System Based on Document Transformations. S. R. Friedman, J. A. Maceyak, and S. F. Weiss

Amit Singhal, Chris Buckley, Mandar Mitra. Department of Computer Science, Cornell University, Ithaca, NY 14853

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

2 Partitioning Methods for an Inverted Index

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

Chinese track City took part in the Chinese track for the rst time. Two runs were submitted, one based on character searching and the other on words o

University of Santiago de Compostela at CLEF-IP09

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

A B. A: sigmoid B: EBA (x0=0.03) C: EBA (x0=0.05) U

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

DATABASE MERGING STRATEGY BASED ON LOGISTIC REGRESSION

Making Retrieval Faster Through Document Clustering

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

TREC-3 Ad Hoc Retrieval and Routing. Experiments using the WIN System. Paul Thompson. Howard Turtle. Bokyung Yang. James Flood

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Enumeration of Full Graphs: Onset of the Asymptotic Region. Department of Mathematics. Massachusetts Institute of Technology. Cambridge, MA 02139

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

2 J. Karvo et al. / Blocking of dynamic multicast connections Figure 1. Point to point (top) vs. point to multipoint, or multicast connections (bottom

Networks for Control. California Institute of Technology. Pasadena, CA Abstract

second_language research_teaching sla vivian_cook language_department idl

nding that simple gloss (i.e., word-by-word) translations allowed users to outperform a Naive Bayes classier [3]. In the other study, Ogden et al., ev

Chapter 27 Introduction to Information Retrieval and Web Search

Introduction to Information Retrieval

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

The two successes have been in query expansion and in routing term selection. The modied term-weighting functions and passage retrieval have had small

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp

From Passages into Elements in XML Retrieval

Reimplementation of the Random Forest Algorithm

Question Answering Approach Using a WordNet-based Answer Type Taxonomy

A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

Information Retrieval Term Project : Incremental Indexing Searching Engine

Research on outlier intrusion detection technologybased on data mining

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California

Session 2: Models, part 1


Component ranking and Automatic Query Refinement for XML Retrieval

523, IEEE Expert, England, Gaithersburg, , 1989, pp in Digital Libraries (ADL'99), Baltimore, 1998.

Department of. Computer Science. Remapping Subpartitions of. Hyperspace Using Iterative. Genetic Search. Keith Mathias and Darrell Whitley

PRELIMINARY RESULTS ON REAL-TIME 3D FEATURE-BASED TRACKER 1. We present some preliminary results on a system for tracking 3D motion using

Let v be a vertex primed by v i (s). Then the number f(v) of neighbours of v which have

The Boundary-Restricted Coherence Protocol for Rennes Cedex Riverside, CA Telephone: [33] Telephone: [1] (909) 787{7206

Binary vector quantizer design using soft centroids

CS54701: Information Retrieval

Dept. of Computer Science. The eld of time series analysis and forecasting methods has signicantly changed in the last

Interface. Dispatcher. Meta Searcher. Index DataBase. Parser & Indexer. Ranker

The Utrecht Blend: Basic Ingredients for an XML Retrieval System

A Universal Model for XML Information Retrieval

detected inference channel is eliminated by redesigning the database schema [Lunt, 1989] or upgrading the paths that lead to the inference [Stickel, 1

Dierencegraph - A ProM Plugin for Calculating and Visualizing Dierences between Processes

Automatic Cluster Number Selection using a Split and Merge K-Means Approach

A Fusion Approach to XML Structured Document Retrieval

Dierential-Linear Cryptanalysis of Serpent? Haifa 32000, Israel. Haifa 32000, Israel

characteristic on several topics. Part of the reason is the free publication and multiplication of the Web such that replicated pages are repeated in

where w t is the relevance weight assigned to a document due to query term t, q t is the weight attached to the term by the query, tf d is the number

Inference Networks for Document Retrieval. A Dissertation Presented. Howard Robert Turtle. Submitted to the Graduate School of the

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Network. Department of Statistics. University of California, Berkeley. January, Abstract

GlOSS: Text-Source Discovery over the Internet

Modern Information Retrieval

Prewitt. Gradient. Image. Op. Merging of Small Regions. Curve Approximation. and

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval

Syskill & Webert: Identifying interesting web sites

Text Documents clustering using K Means Algorithm

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

An Adaptive Agent for Web Exploration Based on Concept Hierarchies

Transcription:

Routing and Ad-hoc Retrieval with the TREC-3 Collection in a Distributed Loosely Federated Environment Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers University of Dortmund, Germany Abstract In this paper, we investigate retrieval methods for loosely coupled IR systems. In such an environment, each IR system operates independently on its own document collection. For query processing, an agent takes the query and sends it to the dierent IR systems. From the answers received from theses servers, it forms a single ranking and sends it back to the user. In the work presented here, we examine different retrieval methods for performing routing and ad-hoc-queries in such an environment. For experiments, we use the TREC-3 collection and the SMART retrieval system. 1 Introduction In the past, IR systems have been considered as running on a large, fairly static document collection held on a single machine. However, there is a growing need for the development of IR systems that operate in a distributed environment: Document collections for similar subject elds are set up at dierent locations with dierent systems and retrieval methods. In retrieval, a user would like to query all relevant databases at once, without bothering about the underlying dierences. With the improvements of local computing power and network access, the coupling of local and non-local IR data bases becomes possible. On leave from: Universidad de los Andes, Merida, Venezuela 1

So far, little research has been performed that focuses on retrieval in a distributed environment. More precisely, we must distinguish between dierent types of distribution. We use the term \distributed IR systems" in the same way as in the eld of databases: On the logical level, such a system behaves in the same way as a non-distributed systems, the dierences are only at the physical level (e.g. distributed storage, replication and partitioning of data, special query processing methods and communication protocols). In contrast, loosely coupled systems also behave dierently on the logical level. This approach oers the advantage that it is possible to couple existing systems, without the need to change the software. A simple example for loosely-coupled IR systems is the design of the WAIS system (see [Kahle et al. 93]). The underlying protocol Z39.50 allows a client to do parallel searches in IR systems. In this paper, we only consider the latter type of system, although some of the methods investigated also may be feasible for distributed IR systems. We call the dierent IR systems \servers". A user submits a query to a client, which in turn connects to an \agent". The agent's task is to communicate with the dierent servers. It sends the query to the servers, receives the dierent answers and then forms a single ranking and sends it back to the user. Some of the servers contacted may in fact also be agents, which in turn know further servers or agents that help processing the query. An example structure of this type is depicted in gure 1. An example for such a structure is the SFgate system described in [Pfeifer et al. 95] which implements a WAIS gateway for WWW clients and acts as an agent in that it can access several servers in parallel. In the original WAIS system, there are only clients and servers, where the client also must perform the agent's task. For performing retrieval in such an envioronment, a number of problems have to be solved: How do agents and servers communicate? What information should the agent have about the dierent servers that it can access? How can the agent select servers that provide relevant documents for the current query? Which retrieval methods should be provided by the servers in order to give optimum support for retrieval in loosely coupled IR data bases? How should the information agent merge incoming ranking results from dierent sources (with possibly unknown quality)?

User Client Agent Agent Figure 1: Retrieval with loosely coupled IR systems What retrieval quality can be archieved in such an environment, or what is the loss in retrieval quality in contrast to retrieval on a single large data base? As the TREC collection originally consists of ve dierent document collections, it is quite natural to use this collection for experiments with a distributed environment. In contrast to the very general case, we only consider a single agent contacting ve servers, each running on one of the subcollections. As experimental system, the SMART retrieval system was used for simulating retrieval methods in this environment. So we did not really implement a distributed environment, since our focus here is on the investigation of retrieval methods for such an environment. Once good retrieval methods have been devised, an architecture for federated IR systems supporting these methods can be developed.

In the remainder of this paper, we rst describe the basic concepts of the retrieval methods investigated, followed by the description of the experiments performed. The results are discussed in Section 4, and nally, we come to the conclusions and give an outlook on further work. 2 Basic concepts 2.1 Distributed routing retrieval For the routing task, we have queries and a set of documents with relevance judgements. As new documents arrive, the system has to assign relevance status values (RSVs) to these documents, by considering these documents one by one. Since there is relevance feedback data available, learning methods can be applied in order to improve the representation of queries. As a result, the weights of the query terms are modied. Furthermore, we can apply query expansion methods in order to add terms to the original query. In a distributed environment, the agent collects relevance feedback data and performs the learning process, possibly in cooperation with the servers. Having improved the query representation this way, it sends it to the servers. Each server processes this query against its incoming documents, and sends the requested number of documents (or all documents with a RSV exceeding a predened threshold) back to the agent. The agent combines the streams from the dierent servers into one single output stream and sends it to the client. In our experiments, we assume that we can compute overall document frequencies, thus being able to compute query term weights in the same way as in a single collection environment. This can be achieved by additional communication overhead, where the agent collects the collection frequencies of the query terms from all servers in the learning phase. On the other hand, document indexing weights are independent of the overall document frequencies of terms. 2.2 A Concept for distributed ad-hoc retrieval For ad-hoc queries, only the query formulation, but no relevance feedback data is given for the current query. After receiving a new query from a client, the agent forwards the query to the servers (possibly after preselecting the servers relevant for this query). Each server processes this query on its database and sends the prespecied number of answer documents to the agent. Now the agent merges these

results into one single ranked output list. For performing this task, there are at least three dierent possibilities: 1. The agent takes the RSVs computed by the servers as absolute values and forms the output by merging the input lists according to decreasing RSVs. 2. The agent collects all the documents received for a query and creates a new ranking for this set of documents, e.g. by treating this set as a small document collection. 3. If the server receives additional collection information from the servers, a better ranking can be produced: Assume that each server also sends its document frequencies for all query terms. Then the agent can compute the overall document frequencies and use them as improved query term weights for ranking the set of documents received. In our experiments, we investigated the last two possibilities. 3 Experiments For the description of the document and query indexing methods given below, we use the following notations: q k query d m document t i term tf im within-document-frequency (wdf) of t i in d m max tf m maximum wdf tf mi of all terms in d m. idf i inverse document frequency of t i Based on these parameters, standard SMART indexing routines were applied ([Salton & Buckley 88]): The augmented term frequency atf im is computed as! atf im = 0:5 1 + tf im ; max tf m and the logarithmic term frequency ltf im is dened as ltf im = 1 + ln tf im : Furthermore, we always use cosine normalization for document indexing, i.e. the weights are normalized such that the sum of squares of the indexing weights in a document is 1.

3.1 Distributed Routing As pointed out in 2.1, we can use the same method of query term weighting as in a single collection environment. On the other hand, document indexing cannot be based on overall document frequencies. For this reason, we choose to use no collection frequency information at all for document indexing. This method is also suited very well for highly dynamic collections. For experimental setting, the following steps were performed: 1. The training data set D1 rst was indexed with the \anc" method using only augmented term frequency and cosine normalization. Alternatively, we also applied the \ltc" method based on logarithmic term frequency in combination with cosine normalization. 2. The query set Q3 (queries 101 to 150) was weighted with the \atc" method using the product of augmented term frequency and inverted document frequency. 3. For internal verication of our methods, data set D2 was used by applying the same document indexing method as for D1. 4. Since the document indexing weights are independent of any specic collection, we were able to use a single database containing the whole TREC collection for our simulation experiments. 5. For learning from feedback data, we produced statistics of terms and phrases only from the relevant documents of D1 for the queries of Q3. 6. Using this statistics, we ran query expansion of Q3 using the standard Rocchio method (see e.g. [Salton & Buckley 90]). Let ~q the original query vector, ~r the centroid of the relevant document vectors and ~n the centroid of the nonrelevant ones, then we computed a improved document vector according to the formula q ~ = 8 0 ~q + 16 ~r? 4 ~n: For query expansion, only a certain number of terms nally was considered in the computation of ~ q 0. For this purpose, we used \percent expansion" with dierent parameters for words and phrases. That is, from all terms occurring at least once within a relevant document, only a certain percentage was considered in the nal query; for this purpose, terms were ranked according to the number of relevant documents in which they occurred. 7. By using only relevant documents of D1 for expansion statistics we tried to avoid inuence of terms from non-retrieved but nonrelevant documents in the occurrence statistic and in the expansion process.

Table 1 shows the experimental results for query set Q3 and document set D2. Obviously, document indexing method \lnc" gives better results than \anc". We also varied the weighting factors for phrases in order to optimize retrieval quality. For query expansion, percent expansion with different parameters was tested and compared with the case of a xed number of expansion terms. The parameter combination of the last line in table 1 was used for the ocial run. For computing the query term weights, the combination of the document sets D1 and D2 was used. Running these queries on the test data set D3 produced the ocial run dortr1. 3.2 Distributed ad-hoc retrieval For simulating distributed retrieval, we tested the cases 2 and 3 from above. For this purpose, document sets D1 and D2 were split according to the 5 dierent sources, thus forming 5 separate databases. The documents in each database were indexed with the method \ltc", i.e. the product of logistic term frequency and inverse document frequency, followed by cosine normalization. The idf weight was computed from the local collection frequency only. Then, for each query from the set Q4, the following steps were performed: 1. For each database, the query was indexed with the \ltc" method (using local collection frequency only). With this query, the top ranking 1000 documents were selected from each database. 2. With the 5 results, a new temporary document collection was formed. In a distributed environment, this document base would be constructed by the agent. 3. For testing the case 2 from above, the temporary collection was reindexed with the \lnc" method, and the query with Q4 with \ltc" (based on frequency information from the temporary collection only). Retrieval with this combination produced the results for the ocial run dortd2. 4. For case 3, we assumed that we had collection frequency information for each query term from all servers. Thus, we could apply the \ltc" indexing method for both documents and queries based on frequency information from the whole collection. This way, the ocial run dortd1 was produced.

index. word phrase word phrase 11pmethod exp. exp. wt. wt. average anc 20% 8% 1.0 0.5 0.4051 anc 16% 8% 1.0 0.5 0.4098 anc 14% 7% 1.0 0.5 0.4121 anc 13% 6% 1.0 0.5 0.4125 anc 14% 6% 1.0 0.5 0.4127 anc 300 50 1.0 0.5 0.4130 anc 350 50 1.0 0.5 0.4141 anc 14% 7% 1.0 0.7 0.4147 anc 14% 6% 1.0 0.7 0.4149 anc 14% 5% 1.0 0.7 0.4150 anc 14% 4% 1.0 0.7 0.4155 anc 14% 3% 1.0 0.7 0.4155 anc 13% 6% 1.0 0.7 0.4157 anc 14% 6% 1.0 0.8 0.4146 anc 14% 7% 1.0 1.0 0.4120 lnc 14% 4% 1.0 0.5 0.4275 lnc 350 50 1.0 0.5 0.4275 lnc 13% 4% 1.0 0.5 0.4278 lnc 12% 4% 1.0 0.5 0.4279 lnc 11% 4% 1.0 0.5 0.4283 lnc 14% 7% 1.0 0.7 0.4259 lnc 14% 6% 1.0 0.7 0.4268 lnc 16% 3% 1.0 0.7 0.4274 lnc 15% 4% 1.0 0.7 0.4274 lnc 13% 6% 1.0 0.7 0.4274 lnc 350 50 1.0 0.7 0.4277 lnc 15% 3% 1.0 0.7 0.4278 lnc 14% 5% 1.0 0.7 0.4278 lnc 300 50 1.0 0.7 0.4280 lnc 14% 4% 1.0 0.7 0.4282 lnc 14% 3% 1.0 0.7 0.4284 lnc 13% 4% 1.0 0.7 0.4284 lnc 12% 4% 1.0 0.7 0.4291 Table 1: Results of learning runs for routing (learning with D1, testing with D2)

4 Results 4.1 Routing results The results of the ocial runs show that our routing run is clearly above the average of all runs. Thus, we can conclude that our approach although it is very simple works and yields good results. 4.2 Ad-hoc results Both ad-hoc runs produced results of average quality. Considering the constraints underlying our approach, and that no tuning or learning was performed, this is a positive result. Comparing the two approaches, there seems to be no signicant dierence. Thus, we can conclude that it does not matter whether or not the ranking performed in the agent can use global or local term frequency information only. So the overhead in transmitting this frequency information to the agent can be saved. It seems that the retrieval quality is aected mainly by the indexing functions used in the servers. 5 Conclusions and Outlook The results of our experiments show that retrieval in distributed environments can be performed without loosing too much in terms of retrieval quality. Here we only have considered very simple weighting schemes. By using more sophisticated schemes, and especially by applying appropriate learning methods, much better results can be expected. With the increase of network connectivity and the growing number of document bases accessible, the development of retrieval methods for even large numbers of loosely coupled IR systems is becoming a major research issue. References Kahle, B.; Morris, H.; Goldman, J.; Erickson, T.; Curran, J. (1993). Interfaces for Distributed Systems of Information s. Journal of the American Society for Information Science 44(8), pages 453{467. Pfeifer, U.; Fuhr, N.; Huynh, T.T. (1995). Searching Structured Documents with the Enhanced Retrieval Functionality of freewaissf and SFgate. To appear in: Proc. of the 3rd World Wide Web

Conference '95. Also available from http://ls6-www.informatik.unidortmund.de/~pfeifer/fwsf.html. Salton, G.; Buckley, C. (1988). Term Weighting Approaches in Automatic Text Retrieval. Information Processing and Management 24(5), pages 513{ 523. Salton, G.; Buckley, C. (1990). Improving Retrieval Performance by Relevance Feedback. Journal of the American Society for Information Science 41(4), pages 288{297.