A Methodology for Collection Selection in Heterogeneous Contexts

Similar documents
Approaches to Collection Selection and Results Merging for Distributed Information Retrieval

Obtaining Language Models of Web Collections Using Query-Based Sampling Techniques

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16

Federated Text Search

CS47300: Web Information Search and Management

Combining CORI and the decision-theoretic approach for advanced resource selection

Collection Selection and Results Merging with Topically Organized U.S. Patents and TREC Data

CS54701: Information Retrieval

ABSTRACT. Categories & Subject Descriptors: H.3.3 [Information Search and Retrieval]: General Terms: Algorithms Keywords: Resource Selection

DATABASE MERGING STRATEGY BASED ON LOGISTIC REGRESSION

Term Frequency Normalisation Tuning for BM25 and DFR Models

Result merging strategies for a current news metasearcher

number of documents in global result list

Making Retrieval Faster Through Document Clustering

Evaluation of Meta-Search Engine Merge Algorithms

ResPubliQA 2010

Report on the TREC-4 Experiment: Combining Probabilistic and Vector-Space Schemes

Melbourne University at the 2006 Terabyte Track

Frontiers in Web Data Management

TREC-10 Web Track Experiments at MSRA

RMIT University at TREC 2006: Terabyte Track

Federated Text Retrieval From Uncooperative Overlapped Collections

An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments

From Passages into Elements in XML Retrieval

A Practical Passage-based Approach for Chinese Document Retrieval

A Formal Approach to Score Normalization for Meta-search

Robust Relevance-Based Language Models

Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied

Learning Collection Fusion Strategies for Information Retrieval

Study on Merging Multiple Results from Information Retrieval System

Query Likelihood with Negative Query Generation

Hierarchical Location and Topic Based Query Expansion

A Meta-search Method with Clustering and Term Correlation

Noisy Text Clustering

IITH at CLEF 2017: Finding Relevant Tweets for Cultural Events

Methods for Information Server Selection

GlOSS: Text-Source Discovery over the Internet

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Exploiting Index Pruning Methods for Clustering XML Collections

Reducing Redundancy with Anchor Text and Spam Priors

[31] T. W. Yan, and H. Garcia-Molina. SIFT { A Tool for Wide-Area Information Dissemination. USENIX

Distributed similarity search algorithm in distributed heterogeneous multimedia databases

Capturing Collection Size for Distributed Non-Cooperative Retrieval

Fondazione Ugo Bordoni at TREC 2004

University of Delaware at Diversity Task of Web Track 2010

An Attempt to Identify Weakest and Strongest Queries

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

Tilburg University. Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval

Verbose Query Reduction by Learning to Rank for Social Book Search Track

Weighted Suffix Tree Document Model for Web Documents Clustering

A Study of Collection-based Features for Adapting the Balance Parameter in Pseudo Relevance Feedback

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

Methods for Distributed Information Retrieval Nicholas Eric Craswell

Static Index Pruning for Information Retrieval Systems: A Posting-Based Approach

A Topic-based Measure of Resource Description Quality for Distributed Information Retrieval

Implementing a customised meta-search interface for user query personalisation

A User Profiles Acquiring Approach Using Pseudo-Relevance Feedback

Query Expansion with the Minimum User Feedback by Transductive Learning

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches

Distributed Information Retrieval

TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University

Automatically Generating Queries for Prior Art Search

Improving Difficult Queries by Leveraging Clusters in Term Graph

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

Partial Collection Replication For Information Retrieval

An Investigation of Basic Retrieval Models for the Dynamic Domain Task

A NEW PERFORMANCE EVALUATION TECHNIQUE FOR WEB INFORMATION RETRIEVAL SYSTEMS

Maximal Termsets as a Query Structuring Mechanism

Keywords APSE: Advanced Preferred Search Engine, Google Android Platform, Search Engine, Click-through data, Location and Content Concepts.

Evaluating Relevance Ranking Strategies for MEDLINE Retrieval

Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets

RSDC 09: Tag Recommendation Using Keywords and Association Rules

UMass at TREC 2017 Common Core Track

Huffman-DHT: Index Structure Refinement Scheme for P2P Information Retrieval

Automatic Structured Query Transformation Over Distributed Digital Libraries

Term-Specific Smoothing for the Language Modeling Approach to Information Retrieval: The Importance of a Query Term

Evaluatation of Integration algorithms for Meta-Search Engine

Siemens TREC-4 Report: Further Experiments with Database. Merging. Ellen M. Voorhees. Siemens Corporate Research, Inc.

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

indexing and query processing. The inverted le was constructed for the retrieval target collection which contains full texts of two years' Japanese pa

Collection Selection with Highly Discriminative Keys

Finding Topic-centric Identified Experts based on Full Text Analysis

Leveraging Transitive Relations for Crowdsourced Joins*

On Duplicate Results in a Search Session

Northeastern University in TREC 2009 Million Query Track

IRCE at the NTCIR-12 IMine-2 Task

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks

Merging algorithms for enterprise search

Patent Classification Using Ontology-Based Patent Network Analysis

Report on the CLEF-IP 2012 Experiments: Search of Topically Organized Patents

Inter and Intra-Document Contexts Applied in Polyrepresentation

Content-based search in peer-to-peer networks

External Query Reformulation for Text-based Image Retrieval

A Unified User Profile Framework for Query Disambiguation and Personalization

Focused Retrieval Using Topical Language and Structure

Indri at TREC 2005: Terabyte Track (Notebook Version)

A Cluster-Based Resampling Method for Pseudo- Relevance Feedback

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Document Structure Analysis in Associative Patent Retrieval

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

Transcription:

A Methodology for Collection Selection in Heterogeneous Contexts Faïza Abbaci Ecole des Mines de Saint-Etienne 158 Cours Fauriel, 42023 Saint-Etienne, France abbaci@emse.fr Jacques Savoy Université de Neuchâtel Pierre-à-Mazel 7, 2000 Neuchâtel, Suisse Jacques.Savoy@unine.ch Michel Beigbeder Ecole des Mines de Saint-Etienne 158 Cours Fauriel, 42023 Saint-Etienne, France mbeig@emse.fr Abstract In this paper we demonstrate that in an ideal Distributed Information Retrieval environment, taking the ability of each collection server to return relevant documents into account when selecting collections can be effective. Based on this assumption, we suggest a new approach to resolve the collection selection problem. In order to predict a collection's ability to return relevant documents, we inspect a limited number n of documents retrieved from each collection and analyze the proximity of search keywords within them. In our experiments, we vary the underlying parameter n of our suggested model to define the most appropriate number of top documents to be inspected. Moreover, we evaluate the retrieval effectiveness of our approach and compare it with both the centralized indexing and the CORI approaches [1], [16]. Preliminary results from these experiments, conducted on WT10g test collection, tend to demonstrate that our suggested method can achieve appreciable retrieval effectiveness. Keywords Information retrieval, distributed information retrieval, collection selection, results merging strategy, evaluation. 1: Introduction Distributing storage and search processing seem to be the appropriate solutions to overcoming several limitations inherent in Centralized Information Retrieval (CIR) systems [7], particularly those due to the exponential growth in information available on the Internet. Distributed Information Retrieval (DIR) system architecture in its simplest form consists of various collection servers and a broker. Typically, the broker receives the user's query, forwards it to the carefully selected collection server subset most likely to contain relevant documents responding to this query (e.g., based on search keywords, collection statistics, query language or a subset of servers pre-selected by the user). Finally, the broker combines the individual result lists (submitted by each selected collection) in order to produce a single ranked list. While most previous studies found that carrying out a collection selection within DIR systems decreases effectiveness [16], some recent studies tend to demonstrate that when a good selection is provided, retrieval effectiveness in DIR systems has the potential of being just as effective as single CIR systems [12], [17]. In this paper, we will investigate how to select a subset of collection servers most likely to be relevant to a given request, and thus obtain improved retrieval performance. In this vein, there are three main differences between our collection selection approach and those of various other selection methods. Firstly, our approach does not use any pre-stored metadata to predict the collection's relevance to the query. Secondly, it does not require collection ranking: each collection is selected independently from the others. Thirdly, our approach does not require any collection or server cooperation. The rest of this paper is organized as follows. The next section describes previous works that attempt to resolve the collection selection problem. Section 3 illustrates the testbed we used in our experiments. Section 4 demonstrates that a collection's ability to return relevant documents serves can be a good criterion for defining a selection procedure. Our suggested approach is presented in Section 5. Section 6 discusses in detail the evaluations we carried out on the WT10g test collection, and compares our strategy's performance with that of other selection schemes. 2: Collection selection Collection selection can be performed automatically or manually, and for a given query it consists of selecting collections likely to contain relevant documents. Obviously, ignoring this step and sending the submitted request to all known collections is one possible solution (an approach usually chosen by novice users) and we will denote this approach as NS (for No-Selection). This method is however expensive in terms of resources, and it also increases user latency. When a proper collection selection is made it is possible to achieve results superior to that of the NS approach. Thus, the goal of collection

selection is to reduce the number of collections searched as much as possible, without decreasing retrieval effectiveness [5], [7]. In automatically selecting a subset of servers, most collection selection techniques compute a score for each collection, based on its usefulness to the query submitted. The collections are then ranked according to these scores, and the system might select either the N highest ranking collections or those collections with a score exceeding a given threshold. In order to calculate these collection scores however, the broker must have access to some information about a collection's contents, and the various approaches suggested differ, be it in the nature of the global information or in the manner in which it is acquired. Previous works have included collection descriptions [2] or collection statistics (frequency, co-occurrence, etc.) [6], [10], [19] in their collection selection methods, techniques that require a certain degree of collection cooperation. Xu & Croft [17] suggested selecting documents according to their topics, and also a language model associated with each of these topics. Callan et al. [1] or Xu & Callan [16] described a Collection Retrieval Inference network (CORI), in which each collection is considered as a single gigantic document. Ranking these collection documents used methods similar to those employed in conventional information retrieval systems. Larkey et al. [8] found that CORI works well when collections are organized topically. Several methods were developed by Zobel [19] to calculate collection scores, while Moffat & Zobel [10] suggested decomposing each collection into blocks of documents, where the blocks were indexed by the broker. The resulting index was then used to find blocks having high-ranking scores with respect to the query, and then collections were selected to match these blocks. The GlOSS system [6] ranked collections according to their appropriateness to the query submitted, estimating the number of documents in each collection for which query similarity was greater than a predefined threshold, and creating a collection score through summing these similarities. For a given query, Hawking & Thistlewaite [7] proposed broadcasting a probe query (containing one to three terms) to all available collections, each of which responding with term information would then be used to calculate collection score. Towell et al. [15] developed a learning method in which they determined the optimum number of documents to be retrieved from each collection, rather than defining the number of collections to be searched. In their calculation of collection scores, Craswell et al. [4] included the search engine's retrieval effectiveness for each collection. Ogilvie & Callan [11] tested the use of query expansion to resolve the collection selection problem, however their results were not conclusive. Yu et al. [18] ranked collections by incorporating information on linkages between documents in a Web environment. We believe the CORI to be a suitable representative of the above strategies, and will describe its selection procedure in more detail. We will then make evaluate it by making a comparison to the approach we used in our experiments. The CORI approach uses an inference network to rank collections. For the ith collection and for a given query q, the collection score was computed as: s i = 1 m s(t m j C i ) j=1 where s(t j C i ) indicates the contribution of the search term t j in the score of collection C i calculated as follows: C + 0.5 log df cf s(t j C i ) = defb + (1 defb) i j df i + K log C + 1.0 ( ) where K = k ( 1 b) + b lc i avlc - m is the number of terms included in query q, - C is the number of collections, - df i is the number of documents in collection C i containing the jth query term, - cf j is the number of collections containing the query term t j, - lc i is the number of indexing terms in C i, - avlc is the average number of indexing terms in each collection, and - defb, b and k are constants and, as suggested by Xu & Callan [16], were set at the following values: defb = 0.4, k = 200 and b = 0.75. After ranking the collections according to their scores, one possibility is to select the N top ranked collections, where N can determined by the user. Another possibility is to use an algorithm to cluster the collection scores and then select those collections in the top clusters [1]. We evaluated the latter case using a cluster difference threshold α = 0.0002. 3: Our testbed Our experiments were conducted with the test collection WT10g containing 1,692,096 Web pages from sites around the world (total size = 11,033 MB) and which was used for the Web track of TREC9 conference. The queries used in our experiments were built from only topic titles, and thus correspond to real-life requests sent by users to the Excite search engine. Included were a variety of topics (e.g., Parkinson s disease, hunger, Baltimore, how e-mail benefits businesses or Mexican food culture ), and query lengths were rather short (a mean of 2.4 words and standard deviation of 0.6). In order to simulate distributed collections, we divided the testbed into eight collections, each having roughly the same number of documents and the same size. Table 1

depicts various statistics about these collections, including size, number of documents, and number of queries having at least one relevant item. For some requests there were no relevant documents returned by the collections, for example Queries #464 and #487 did not return any documents from any of the eight collections, due to spelling errors (Query #464 was written as nativityscenes and Query #487 as angioplast7 ). All collections were indexed by the SMART system, using the Okapi [14] probabilistic search model (see Appendix 2 for details). Collection Size (MB) # # Rel.- Query TREC9.1 TREC9.2 TREC9.3 TREC9.4 TREC9.5 TREC9.6 TREC9.7 TREC9.8 1,325 1,474 1,438 1,316 1,309 1,311 1,336 1,524 207,485 207,429 221,916 202,049 203,073 215,451 200,146 234,547 28 44 38 21 22 24 25 43 Table 1: Summary of our testbed statistics 4: Which is the best collection selection? Some recent experiments have shown that DIR systems can outperform CIR systems in term of average precision, if a good selection is provided [12], [17]. A good selection would be described as one in which noise 1 and silence 2 have decreased. The question is: how do you make a good selection? To achieve this objective of decreasing noise and silence, we might choose collections containing at least one relevant document. Thus, we would decrease noise by eliminating collections without any relevant items, and decrease silence by maximizing the number of relevant documents that the system may retrieve. We will denote such a selection approach as Optimal1. On the other hand, there may be collections that contain relevant documents but that use an ineffective retrieval strategy. This means placing relevant documents at the end of the returned list, meaning they have little chance of being consulted by the user. We believed that such a collection should be eliminated. Thus, a good selection strategy would select collections that contain and are able to return relevant items, and present them among the n top documents. We will denote such a selection approach as Optimal2(n). Before presenting our selection approach, we want to verify the following hypothesis: 1. Optimal1 produces better results that both NS and centralized (labeled Single in our tables) approaches. 1 Set of irrelevant documents returned by the search engine. 2. Optimal2(n) results in better retrieval performance than do both Optimal1 and the centralized approach. In order to verify the second assumption, we need some idea of the best value of the underlying parameter n. In order to verify these assumptions, and in order to process the Optimal1 and Optimal2(n) selection procedures, we used our test collection and our knowledge about their relevant items. It should be noted that the centralized approach retrieves documents from a single database (labeled Single in our tables), and that the NS approach selects all collections. The Optimal1 selection procedure selects all collections containing at least one relevant document, and the Optimal2(n) procedure selects all collections containing at least one relevant document in the n top documents. To produce a single result list from the various result lists provided by the collection servers, we adopted the raw-score merging procedure that merges individual result lists based on document scores [1], [5]. Table 2 shows the average precision achieved by the single, NS and Optimal1 selection procedures. Shown in parentheses in Tables 1 and 2 are percent changes in average precision compared to the single approach. As illustrated, Optimal1 provides better retrieval performance than do either the single approach (+7.7%) or the NS model (+9.2%, when comparing average precision: 21.11 vs. 19.32). Our first assumption is therefore confirmed. Selection approach Average precision (48 queries) Single 19.60% NS 19.32% (-1.4%) Optimal1 21.11% (+7.7%) Table 2: Average precision for three selection procedures In our second assumption we varied the n value from 1 to 5 as shown in Table 3. For reasons of economy, we did not test Optimal2(n) for n greater than 5. Note that for some topics, no collections were selected, therefore we only reported results from queries for which a least one collection was selected. Table 3 shows that for the WT10g corpus and for n > 1, Optimal2(n) manifests better retrieval performance than does Optimal1, and is therefore better than the single approach. As such, our second assumption is also confirmed. Now that our two assumptions have been established, it can be deduced that selecting collections containing at least one relevant document is a good selection strategy. An even better approach is selecting all collections able to return at least one relevant document among the top n documents. Based on these conclusions, the following section will propose our collection selection procedure. 2 Set of relevant documents not returned by the retrieval system.

Single Optimal1 Optimal2(n) n query 1 32 28.64 30.44 (+6.3%) 29.29 (+2.3%) 2 32 28.64 30.44 (+6.3%) 30.99 (+8.2%) 3 35 25.01 26.30 (+5.2%) 28.26 (+13.0%) 4 38 24.62 26.40 (+7.2%) 27.27 (+10.8%) 5 39 24.02 25.79 (+7.4%) 27.17 (+13.1%) Table 3: Average precision for centralized, Optimal1 and Optimal2(n) selection approaches 5: Our selection procedure The previous experiments showed that selecting collections able to return at least one relevant document among the top n retrieved documents is a good approach. The question must then be asked: how can we know if a collection is able to return relevant documents or not? In practice, to achieve this objective we estimate the relevance of the n first documents returned by each collection and then select collections having returned at least one relevant document. This selection procedure will be denoted as TRD-CS (using Top Ranked Documents for Collection Selection ), an approach differing from its predecessors in that it does not assign scores to each collection. It does bear a slight resemblance to the approach developed by Hawking & Thistlewaite [7], because they both assume that a priori no information is available to perform the selection, the required information being obtained during query processing. However, Hawking & Thistlewaite's [7] proposition is unrealistic in the sense that it relies on the widespread adoption of protocols allowing servers to communicate statistics or metadata about their holdings. On the other hand, our method is linked with that of Craswell's et al. [4] because it takes server effectiveness into account. The main guidelines of our approach can be summarized as follows: A broker broadcasts a query to all available servers (denoted C ) which returns n ranked documents to the broker. The broker then calculates the score for each received document (or for n. C items), and sorts them according to their scores. Finally, the collections matching the n_first documents are selected. When we define a document's score, we assume that the following hints can be used as good relevance indicators: the number of search keywords included in each document surrogate, the distance between two query terms in the document and finally, their occurrence frequencies in the document. From this perspective and inspired by Lawrence & Lee Giles [9], we then calculate document scores as follows: nb_ occ score ( d,q ) = ( c 1 nb d ) + ( c 2 dis _ ind( d,q) )+ c 3 where for each document d: - nb d is the number of search keywords included in the document d, - nb_occ is the total number of occurrences of query terms in d, - dis_ind is the indicator of distance between two query terms in d. This function returns a real value greater or equal to zero, - c 1, c 2, c 3 are constants and set to c 1 = 100, c 2 = 1000, c 3 = 1000 in our experiments. According to the formula introduced by Clarke et al. [3] and assuming that the first two query terms are the most important search keywords, we will then compute dis_ind only for these two terms, as follows: dis _ ind( d,q ) = dis(k,l ) i i where - k and l are search keyword positions within the document d delimiting by the ith block, - dis(k,l) i is the score for this block in the document d, which satisfies the query q (i.e., the block contains the first two query terms in our case), and does not include any other smaller block satisfying the query (only the block having the smallest size is selected). For example, consider a given query consisting of two terms t i and t j. If t i appears in the 5th and 25th positions and t j in the 27th position, we may find the first block (k = 5 and l = 27) and the second block (k = 25 and l = 27). As the first block contains the second, this first block is ignored and dist_ind is therefore reduced to dist_ind(d,q) = dis(25,27) = 0.5. More formally, dist(k,l) i is defined as: 1 if ( k,l) > 1 dis (k,l ) i = ( k,l) i i 1 if ( k,l) 1 i In the case of a mono-term query, and according to [9], dis_ind represents the inverse of the distance from the start of the document to the first occurrence of this unique search keyword. Finally, a document obtains a zero value if it does not contain any query terms. 6: Evaluation In our evaluations, and for comparative purposes, we will refer to the centralized approach as a baseline. As a merging procedure, we used the LMS merging strategy ( using result Length to define Merging Score ) [13], where for merging the retrieved items the length of result lists is chosen because it results in the most effective retrieval when used with various selection approaches [13]. Table 4 depicts the average precision achieved by the single, CORI, NS and also our own selection procedure. The second column lists the average precision achieved by the various approaches and the following columns the

Average Precision after : Mean number of precision selected collections 5 10 15 20 30 100 Single 19.60 26.67 22.29 19.17 17.40 15.97 10.98 CORI 19.22 27.92 22.50 19.58 17.71 15.90 10.65 7.52 NS 19.32 27.92 22.08 19.44 17.60 15.76 10.75 8 TRD-CS n = 1 (n_first = 6) n = 2 (n_first = 9) n = 3 (n_first = 9) n = 4 n_first = 12) 18.93 28.33 22.29 20.56 18.85 16.53 10.10 6.00 19.00 28.33 23.13 20.69 18.75 16.04 10.12 6.02 19.00 28.33 23.54 20.56 18.96 16.32 9.90 5.34 19.20 28.33 23.33 19.86 18.44 15.97 10.02 5.82 n = 5(n_first = 13) 18.84 27.92 22.71 19.72 18.33 15.83 10.19 5.96 Table 4: Comparison of average precision achieved by various selection approaches precision after retrieving 5, 10, 15, 20, 30 and 100 documents. We believe these performance measures are useful because the typical user will inspect only the top retrieved items. Finally, for us it is important to know the mean number of selected collections, and these value are depicted in the last column. In our testbed, an average of 4.9 collections over a maximum value of 8 should be selected for each query. Therefore a value near 4.9 in the last column can be viewed as good selection performance. The results depicted in Table 4 were obtained by varying the parameter n (the number of items from each collection to be inspected) as well the parameter n_first (number of documents used as a basis for selecting the underlying collections). The best retrieval performance for our selection approach is given in Appendix 1. Table 4 uses the following typographical conventions to present the results of our evaluation. Results in bold are significantly better, thus a difference of 5% in average performance is considered as significant compared to that of the single approach retrieval performance. Those in italic represent performance significantly inferior when compared to the single model. Finally, regular type denotes no significant difference in retrieval effectiveness. From the data shown in Table 4, we can infer the following conclusions. In reviewing the average precision shown in the second column, none of the retrieval models prove to have significantly better or worse performance than does the single approach. As for precision achieved after retrieving 5, 15 or 20 documents, our selection approach (whatever the value of the parameters n or n_first) usually results in significantly better retrieval effectiveness than the single approach. In such cases our selection approach also shows improved retrieval performance than do the CORI or NS models. When reviewing the precision achieved after retrieving 100 documents however, the results for our selection model degrade slightly. However, Web users will typically not inspect result lists beyond the first 20 retrieved items. Overall, when comparing results obtained by our selection scheme for different values of n, it seems that the best value for this parameter is around n = 3. For our second parameter, the best choice seems to be around n_first = 10 (also see Appendix 1 for the best parameter values obtained when using only average precision). For our model, when the value of the parameter n is increased beyond 3, Table 4 indicates a decrease in retrieval performance. 7: Conclusion In this paper we discussed responses to the following question: What should be considered as good selection in an ideally distributed environment, where knowledge of the entire set of relevant documents is available? To respond to this question we introduced two methods denoted Optimal1 and Optimal2(n). Through experiments conducted on the WT10g test collection, we demonstrated that these two selection approaches provide better retrieval than does the centralized approach. Moreover, the Optimal2(n) selection procedure represents a better choice because it takes into account the ability of the collection servers to return relevant items. We then introduced our own selection method, and based on previous results we confirmed that good collection selection can result when we select those collections able to return at least one relevant document at the top of their response lists. From a practical point of view our selection strategy relies on inspecting the top ranked documents returned by each collection in order to judge the usefulness a collection server. Our approach does not require the creation of any pre-stored metadata, and as such it does not need any up-dates to reflect changes in collection s content. Also, our selection scheme will eliminate those collections containing relevant documents unable to be included in the top retrieved documents. Our experiments were conducted using very short queries, similar to those submitted to

search engines and may therefore be considered as Web realistic. However, our selection strategy does require more transfer traffic in order to download the first n documents per collection, and thus response time may increase slightly. The optimal value of n seems however to be relatively small (around 3), meaning our selection approach did not cause very large downloading delays. Our evaluations also show that our selection procedure returns a reasonable number of collections, with a mean of 72.8% of collections being selected, compared to 94% for the CORI approach. The investigation described in this paper used the same search engine as each collection server, a context more closely reflecting that of a digital library environment, in which all resources are managed by the same search engine. Our current work will also consider the use of different collections, indexed and searched by various search engines. Acknowledgements The author would like to thank C. Buckley from SabIR for giving us the opportunity to use the SMART system, without which this study could not have been conducted. We would also thank Yves Rasolofo from University of Neuchatel for providing us the LMS merging program. This material is based on work supported in part by Région Rhône-Alpes (Eurodoc grant from F. Abbaci) and by SNSF (Swiss National Science Foundation, under grant #21-58 813.99, J. Savoy). References [1] Callan, J.P., Lu, Z., Croft, W.B.: Searching distributed collections with inference networks. Proceedings of the ACM- SIGIR'1995, pp. 21-28. [2] Chakravarthy, A.S., Haase, K.B.: NetSerf: Using semantic knowledge to find internet information archives. Proceeding of the ACM-SIGIR 1995, pp. 4-11. [3] Clarke, C.L.A., Cormack, G.V., Burkowski, F.J.: Shortest substring ranking (MultiText experiments for TREC-4). Proceedings of TREC-4, 1995, pp. 295-304. [4] Craswell, N., Bailey, P., Hawking, D.: Server selection in the world wide web. Proceedings ACM-DL'2000, pp. 37-46. [5] French, J.C., Powell, A.L., Callan, J., Viles, C.L., Emmitt, T., Prey, K.J, Mou, Y.: Comparing the performance of database selection algorithms. Procee-dings of ACM-SIGIR 1999, pp. 238-245. [6] Gravano, L., Garcia-Molina, H., Tomasic, A.: GlOSS: Text-source discovery over the Internet. ACM Transactions on Database Systems, 24(2), 1999, pp. 229-264. [7] Hawking, D., Thistlewaite, P.: Methods for information server selection. ACM Transactions on Information Systems, 17(1), 1999, pp. 40-76. [8] Larkey, L.S., Connell, M., Callan, J.: Collection selection and results merging with topically organized U.S. patents and TREC data. Proceedings ACM-CIKM'2000, pp. 282-289. [9] Lawrence, S., Lee Giles, C. : Inquirus, the NECI meta search engine. Proceedings WWW'7, 1998, pp. 95-105 [10] Moffat, A., Zobel, J.: Information retrieval systems for large document collections. Proceedings of TREC-3, 1995, pp. 85-94. [11] Ogilvie, P., Callan, J.: The effectiveness of query expansion for distributed information retrieval. Proceedings ACM-CIKM'2001, to appear. [12] Powel, A.L., French, J.C., Callan, J., Connell, M., Viles, C.L.: The impact of database selection on distributed searching. Proceedings of the ACM-SIGIR-2000, pp. 232-239. [13] Rasolofo, Y., Abbaci, F., Savoy, J.: Approaches to collection selection and results merging for distributed information retrieval. Proceedings ACM-CIKM'2001, to appear. [14] Robertson, S.E., Walker, S., Beaulieu, M.: Experimentation as a way of life: Okapi at TREC. Information Processing & Management, 36(1), 2000, pp. 95-108. [15] Towell, G., Voorhees, E.M., Narendra, K.G., Johnson- Laird, B.: Learning collection fusion strategies for information retrieval. Proceedings of The Twelfth Annual Machine Learning Conference, 1995, pp. 540-548. [16] Xu, J., Callan, J.P.: Effective retrieval with distributed collections. Proceedings of the ACM-SIGIR'1998, pp. 112-120. [17] Xu, J., Croft, W.B.: Cluster-based language models for distributed retrieval. Proceedings of ACM-SIGIR 1999, pp. 254-261. [18] Yu, C., Meng, W., Wu, W., Liu, K.-L.: Efficient and effective metasearch for text databases incorporating linkages among documents. Proceedings ACM-SIGMOD'2001, pp. 187-198. [19] Zobel, J.: Collection selection via lexicon inspection. Proceedings of the Second Australian Document Computing Symposium, 1997, pp. 74-80. Appendix 1 : Additional evaluations Average precision Difference with single Mean number of selected collection Single 19.60 n n_first 1 7 19.34-1.33% 7.00 2 12 19.58-0.10% 7.22 3 14 19.46-0.71% 6.86 4 15 19.61 +0.05% 6.62 5 21 19.47-0.66% 7.26 Table A: The best average precision obtained with our selection procedure Appendix 2. Search model equation The Okapi probabilistic model [14] calculates the weight of the term t within a document d as follows: w td = (k 1 + 1) tf td K + tf td

where l K = k ( 1 b) + b d advl - l d is the document length, - advl is the average of document length (set to 750), - b is a constant (set to 0.9), - k is a constant (set to 2), - k 1 is a constant (set to 1.2), - tf td is the occurrence frequency of the term t in document d. The following formula shows the weight assigned to a search keyword t within the query q: w tq = tf tq log n df t k 3 + tf tq df t where - tf tq is the search term frequency, - df t is the number of documents in the collection containing the term t, - n is the number of documents included to the collection. - k 3 is a constant (set to 1000).