A Methodology for Collection Selection in Heterogeneous Contexts

A Methodology for Collection Selection in Heterogeneous Contexts Faïza Abbaci Ecole des Mines de Saint-Etienne 158 Cours Fauriel, 42023 Saint-Etienne, France abbaci@emse.fr Jacques Savoy Université de Neuchâtel Pierre-à-Mazel 7, 2000 Neuchâtel, Suisse Jacques.Savoy@unine.ch Michel Beigbeder Ecole des Mines de Saint-Etienne 158 Cours Fauriel, 42023 Saint-Etienne, France mbeig@emse.fr Abstract In this paper we demonstrate that in an ideal Distributed Information Retrieval environment, taking the ability of each collection server to return relevant documents into account when selecting collections can be effective. Based on this assumption, we suggest a new approach to resolve the collection selection problem. In order to predict a collection's ability to return relevant documents, we inspect a limited number n of documents retrieved from each collection and analyze the proximity of search keywords within them. In our experiments, we vary the underlying parameter n of our suggested model to define the most appropriate number of top documents to be inspected. Moreover, we evaluate the retrieval effectiveness of our approach and compare it with both the centralized indexing and the CORI approaches [1], [16]. Preliminary results from these experiments, conducted on WT10g test collection, tend to demonstrate that our suggested method can achieve appreciable retrieval effectiveness. Keywords Information retrieval, distributed information retrieval, collection selection, results merging strategy, evaluation. 1: Introduction Distributing storage and search processing seem to be the appropriate solutions to overcoming several limitations inherent in Centralized Information Retrieval (CIR) systems [7], particularly those due to the exponential growth in information available on the Internet. Distributed Information Retrieval (DIR) system architecture in its simplest form consists of various collection servers and a broker. Typically, the broker receives the user's query, forwards it to the carefully selected collection server subset most likely to contain relevant documents responding to this query (e.g., based on search keywords, collection statistics, query language or a subset of servers pre-selected by the user). Finally, the broker combines the individual result lists (submitted by each selected collection) in order to produce a single ranked list. While most previous studies found that carrying out a collection selection within DIR systems decreases effectiveness [16], some recent studies tend to demonstrate that when a good selection is provided, retrieval effectiveness in DIR systems has the potential of being just as effective as single CIR systems [12], [17]. In this paper, we will investigate how to select a subset of collection servers most likely to be relevant to a given request, and thus obtain improved retrieval performance. In this vein, there are three main differences between our collection selection approach and those of various other selection methods. Firstly, our approach does not use any pre-stored metadata to predict the collection's relevance to the query. Secondly, it does not require collection ranking: each collection is selected independently from the others. Thirdly, our approach does not require any collection or server cooperation. The rest of this paper is organized as follows. The next section describes previous works that attempt to resolve the collection selection problem. Section 3 illustrates the testbed we used in our experiments. Section 4 demonstrates that a collection's ability to return relevant documents serves can be a good criterion for defining a selection procedure. Our suggested approach is presented in Section 5. Section 6 discusses in detail the evaluations we carried out on the WT10g test collection, and compares our strategy's performance with that of other selection schemes. 2: Collection selection Collection selection can be performed automatically or manually, and for a given query it consists of selecting collections likely to contain relevant documents. Obviously, ignoring this step and sending the submitted request to all known collections is one possible solution (an approach usually chosen by novice users) and we will denote this approach as NS (for No-Selection). This method is however expensive in terms of resources, and it also increases user latency. When a proper collection selection is made it is possible to achieve results superior to that of the NS approach. Thus, the goal of collection

selection is to reduce the number of collections searched as much as possible, without decreasing retrieval effectiveness [5], [7]. In automatically selecting a subset of servers, most collection selection techniques compute a score for each collection, based on its usefulness to the query submitted. The collections are then ranked according to these scores, and the system might select either the N highest ranking collections or those collections with a score exceeding a given threshold. In order to calculate these collection scores however, the broker must have access to some information about a collection's contents, and the various approaches suggested differ, be it in the nature of the global information or in the manner in which it is acquired. Previous works have included collection descriptions [2] or collection statistics (frequency, co-occurrence, etc.) [6], [10], [19] in their collection selection methods, techniques that require a certain degree of collection cooperation. Xu & Croft [17] suggested selecting documents according to their topics, and also a language model associated with each of these topics. Callan et al. [1] or Xu & Callan [16] described a Collection Retrieval Inference network (CORI), in which each collection is considered as a single gigantic document. Ranking these collection documents used methods similar to those employed in conventional information retrieval systems. Larkey et al. [8] found that CORI works well when collections are organized topically. Several methods were developed by Zobel [19] to calculate collection scores, while Moffat & Zobel [10] suggested decomposing each collection into blocks of documents, where the blocks were indexed by the broker. The resulting index was then used to find blocks having high-ranking scores with respect to the query, and then collections were selected to match these blocks. The GlOSS system [6] ranked collections according to their appropriateness to the query submitted, estimating the number of documents in each collection for which query similarity was greater than a predefined threshold, and creating a collection score through summing these similarities. For a given query, Hawking & Thistlewaite [7] proposed broadcasting a probe query (containing one to three terms) to all available collections, each of which responding with term information would then be used to calculate collection score. Towell et al. [15] developed a learning method in which they determined the optimum number of documents to be retrieved from each collection, rather than defining the number of collections to be searched. In their calculation of collection scores, Craswell et al. [4] included the search engine's retrieval effectiveness for each collection. Ogilvie & Callan [11] tested the use of query expansion to resolve the collection selection problem, however their results were not conclusive. Yu et al. [18] ranked collections by incorporating information on linkages between documents in a Web environment. We believe the CORI to be a suitable representative of the above strategies, and will describe its selection procedure in more detail. We will then make evaluate it by making a comparison to the approach we used in our experiments. The CORI approach uses an inference network to rank collections. For the ith collection and for a given query q, the collection score was computed as: s i = 1 m s(t m j C i ) j=1 where s(t j C i ) indicates the contribution of the search term t j in the score of collection C i calculated as follows: C + 0.5 log df cf s(t j C i ) = defb + (1 defb) i j df i + K log C + 1.0 ( ) where K = k ( 1 b) + b lc i avlc - m is the number of terms included in query q, - C is the number of collections, - df i is the number of documents in collection C i containing the jth query term, - cf j is the number of collections containing the query term t j, - lc i is the number of indexing terms in C i, - avlc is the average number of indexing terms in each collection, and - defb, b and k are constants and, as suggested by Xu & Callan [16], were set at the following values: defb = 0.4, k = 200 and b = 0.75. After ranking the collections according to their scores, one possibility is to select the N top ranked collections, where N can determined by the user. Another possibility is to use an algorithm to cluster the collection scores and then select those collections in the top clusters [1]. We evaluated the latter case using a cluster difference threshold α = 0.0002. 3: Our testbed Our experiments were conducted with the test collection WT10g containing 1,692,096 Web pages from sites around the world (total size = 11,033 MB) and which was used for the Web track of TREC9 conference. The queries used in our experiments were built from only topic titles, and thus correspond to real-life requests sent by users to the Excite search engine. Included were a variety of topics (e.g., Parkinson s disease, hunger, Baltimore, how e-mail benefits businesses or Mexican food culture ), and query lengths were rather short (a mean of 2.4 words and standard deviation of 0.6). In order to simulate distributed collections, we divided the testbed into eight collections, each having roughly the same number of documents and the same size. Table 1

depicts various statistics about these collections, including size, number of documents, and number of queries having at least one relevant item. For some requests there were no relevant documents returned by the collections, for example Queries #464 and #487 did not return any documents from any of the eight collections, due to spelling errors (Query #464 was written as nativityscenes and Query #487 as angioplast7 ). All collections were indexed by the SMART system, using the Okapi [14] probabilistic search model (see Appendix 2 for details). Collection Size (MB) # # Rel.- Query TREC9.1 TREC9.2 TREC9.3 TREC9.4 TREC9.5 TREC9.6 TREC9.7 TREC9.8 1,325 1,474 1,438 1,316 1,309 1,311 1,336 1,524 207,485 207,429 221,916 202,049 203,073 215,451 200,146 234,547 28 44 38 21 22 24 25 43 Table 1: Summary of our testbed statistics 4: Which is the best collection selection? Some recent experiments have shown that DIR systems can outperform CIR systems in term of average precision, if a good selection is provided [12], [17]. A good selection would be described as one in which noise 1 and silence 2 have decreased. The question is: how do you make a good selection? To achieve this objective of decreasing noise and silence, we might choose collections containing at least one relevant document. Thus, we would decrease noise by eliminating collections without any relevant items, and decrease silence by maximizing the number of relevant documents that the system may retrieve. We will denote such a selection approach as Optimal1. On the other hand, there may be collections that contain relevant documents but that use an ineffective retrieval strategy. This means placing relevant documents at the end of the returned list, meaning they have little chance of being consulted by the user. We believed that such a collection should be eliminated. Thus, a good selection strategy would select collections that contain and are able to return relevant items, and present them among the n top documents. We will denote such a selection approach as Optimal2(n). Before presenting our selection approach, we want to verify the following hypothesis: 1. Optimal1 produces better results that both NS and centralized (labeled Single in our tables) approaches. 1 Set of irrelevant documents returned by the search engine. 2. Optimal2(n) results in better retrieval performance than do both Optimal1 and the centralized approach. In order to verify the second assumption, we need some idea of the best value of the underlying parameter n. In order to verify these assumptions, and in order to process the Optimal1 and Optimal2(n) selection procedures, we used our test collection and our knowledge about their relevant items. It should be noted that the centralized approach retrieves documents from a single database (labeled Single in our tables), and that the NS approach selects all collections. The Optimal1 selection procedure selects all collections containing at least one relevant document, and the Optimal2(n) procedure selects all collections containing at least one relevant document in the n top documents. To produce a single result list from the various result lists provided by the collection servers, we adopted the raw-score merging procedure that merges individual result lists based on document scores [1], [5]. Table 2 shows the average precision achieved by the single, NS and Optimal1 selection procedures. Shown in parentheses in Tables 1 and 2 are percent changes in average precision compared to the single approach. As illustrated, Optimal1 provides better retrieval performance than do either the single approach (+7.7%) or the NS model (+9.2%, when comparing average precision: 21.11 vs. 19.32). Our first assumption is therefore confirmed. Selection approach Average precision (48 queries) Single 19.60% NS 19.32% (-1.4%) Optimal1 21.11% (+7.7%) Table 2: Average precision for three selection procedures In our second assumption we varied the n value from 1 to 5 as shown in Table 3. For reasons of economy, we did not test Optimal2(n) for n greater than 5. Note that for some topics, no collections were selected, therefore we only reported results from queries for which a least one collection was selected. Table 3 shows that for the WT10g corpus and for n > 1, Optimal2(n) manifests better retrieval performance than does Optimal1, and is therefore better than the single approach. As such, our second assumption is also confirmed. Now that our two assumptions have been established, it can be deduced that selecting collections containing at least one relevant document is a good selection strategy. An even better approach is selecting all collections able to return at least one relevant document among the top n documents. Based on these conclusions, the following section will propose our collection selection procedure. 2 Set of relevant documents not returned by the retrieval system.

Single Optimal1 Optimal2(n) n query 1 32 28.64 30.44 (+6.3%) 29.29 (+2.3%) 2 32 28.64 30.44 (+6.3%) 30.99 (+8.2%) 3 35 25.01 26.30 (+5.2%) 28.26 (+13.0%) 4 38 24.62 26.40 (+7.2%) 27.27 (+10.8%) 5 39 24.02 25.79 (+7.4%) 27.17 (+13.1%) Table 3: Average precision for centralized, Optimal1 and Optimal2(n) selection approaches 5: Our selection procedure The previous experiments showed that selecting collections able to return at least one relevant document among the top n retrieved documents is a good approach. The question must then be asked: how can we know if a collection is able to return relevant documents or not? In practice, to achieve this objective we estimate the relevance of the n first documents returned by each collection and then select collections having returned at least one relevant document. This selection procedure will be denoted as TRD-CS (using Top Ranked Documents for Collection Selection ), an approach differing from its predecessors in that it does not assign scores to each collection. It does bear a slight resemblance to the approach developed by Hawking & Thistlewaite [7], because they both assume that a priori no information is available to perform the selection, the required information being obtained during query processing. However, Hawking & Thistlewaite's [7] proposition is unrealistic in the sense that it relies on the widespread adoption of protocols allowing servers to communicate statistics or metadata about their holdings. On the other hand, our method is linked with that of Craswell's et al. [4] because it takes server effectiveness into account. The main guidelines of our approach can be summarized as follows: A broker broadcasts a query to all available servers (denoted C ) which returns n ranked documents to the broker. The broker then calculates the score for each received document (or for n. C items), and sorts them according to their scores. Finally, the collections matching the n_first documents are selected. When we define a document's score, we assume that the following hints can be used as good relevance indicators: the number of search keywords included in each document surrogate, the distance between two query terms in the document and finally, their occurrence frequencies in the document. From this perspective and inspired by Lawrence & Lee Giles [9], we then calculate document scores as follows: nb_ occ score ( d,q ) = ( c 1 nb d ) + ( c 2 dis _ ind( d,q) )+ c 3 where for each document d: - nb d is the number of search keywords included in the document d, - nb_occ is the total number of occurrences of query terms in d, - dis_ind is the indicator of distance between two query terms in d. This function returns a real value greater or equal to zero, - c 1, c 2, c 3 are constants and set to c 1 = 100, c 2 = 1000, c 3 = 1000 in our experiments. According to the formula introduced by Clarke et al. [3] and assuming that the first two query terms are the most important search keywords, we will then compute dis_ind only for these two terms, as follows: dis _ ind( d,q ) = dis(k,l ) i i where - k and l are search keyword positions within the document d delimiting by the ith block, - dis(k,l) i is the score for this block in the document d, which satisfies the query q (i.e., the block contains the first two query terms in our case), and does not include any other smaller block satisfying the query (only the block having the smallest size is selected). For example, consider a given query consisting of two terms t i and t j. If t i appears in the 5th and 25th positions and t j in the 27th position, we may find the first block (k = 5 and l = 27) and the second block (k = 25 and l = 27). As the first block contains the second, this first block is ignored and dist_ind is therefore reduced to dist_ind(d,q) = dis(25,27) = 0.5. More formally, dist(k,l) i is defined as: 1 if ( k,l) > 1 dis (k,l ) i = ( k,l) i i 1 if ( k,l) 1 i In the case of a mono-term query, and according to [9], dis_ind represents the inverse of the distance from the start of the document to the first occurrence of this unique search keyword. Finally, a document obtains a zero value if it does not contain any query terms. 6: Evaluation In our evaluations, and for comparative purposes, we will refer to the centralized approach as a baseline. As a merging procedure, we used the LMS merging strategy ( using result Length to define Merging Score ) [13], where for merging the retrieved items the length of result lists is chosen because it results in the most effective retrieval when used with various selection approaches [13]. Table 4 depicts the average precision achieved by the single, CORI, NS and also our own selection procedure. The second column lists the average precision achieved by the various approaches and the following columns the

Average Precision after : Mean number of precision selected collections 5 10 15 20 30 100 Single 19.60 26.67 22.29 19.17 17.40 15.97 10.98 CORI 19.22 27.92 22.50 19.58 17.71 15.90 10.65 7.52 NS 19.32 27.92 22.08 19.44 17.60 15.76 10.75 8 TRD-CS n = 1 (n_first = 6) n = 2 (n_first = 9) n = 3 (n_first = 9) n = 4 n_first = 12) 18.93 28.33 22.29 20.56 18.85 16.53 10.10 6.00 19.00 28.33 23.13 20.69 18.75 16.04 10.12 6.02 19.00 28.33 23.54 20.56 18.96 16.32 9.90 5.34 19.20 28.33 23.33 19.86 18.44 15.97 10.02 5.82 n = 5(n_first = 13) 18.84 27.92 22.71 19.72 18.33 15.83 10.19 5.96 Table 4: Comparison of average precision achieved by various selection approaches precision after retrieving 5, 10, 15, 20, 30 and 100 documents. We believe these performance measures are useful because the typical user will inspect only the top retrieved items. Finally, for us it is important to know the mean number of selected collections, and these value are depicted in the last column. In our testbed, an average of 4.9 collections over a maximum value of 8 should be selected for each query. Therefore a value near 4.9 in the last column can be viewed as good selection performance. The results depicted in Table 4 were obtained by varying the parameter n (the number of items from each collection to be inspected) as well the parameter n_first (number of documents used as a basis for selecting the underlying collections). The best retrieval performance for our selection approach is given in Appendix 1. Table 4 uses the following typographical conventions to present the results of our evaluation. Results in bold are significantly better, thus a difference of 5% in average performance is considered as significant compared to that of the single approach retrieval performance. Those in italic represent performance significantly inferior when compared to the single model. Finally, regular type denotes no significant difference in retrieval effectiveness. From the data shown in Table 4, we can infer the following conclusions. In reviewing the average precision shown in the second column, none of the retrieval models prove to have significantly better or worse performance than does the single approach. As for precision achieved after retrieving 5, 15 or 20 documents, our selection approach (whatever the value of the parameters n or n_first) usually results in significantly better retrieval effectiveness than the single approach. In such cases our selection approach also shows improved retrieval performance than do the CORI or NS models. When reviewing the precision achieved after retrieving 100 documents however, the results for our selection model degrade slightly. However, Web users will typically not inspect result lists beyond the first 20 retrieved items. Overall, when comparing results obtained by our selection scheme for different values of n, it seems that the best value for this parameter is around n = 3. For our second parameter, the best choice seems to be around n_first = 10 (also see Appendix 1 for the best parameter values obtained when using only average precision). For our model, when the value of the parameter n is increased beyond 3, Table 4 indicates a decrease in retrieval performance. 7: Conclusion In this paper we discussed responses to the following question: What should be considered as good selection in an ideally distributed environment, where knowledge of the entire set of relevant documents is available? To respond to this question we introduced two methods denoted Optimal1 and Optimal2(n). Through experiments conducted on the WT10g test collection, we demonstrated that these two selection approaches provide better retrieval than does the centralized approach. Moreover, the Optimal2(n) selection procedure represents a better choice because it takes into account the ability of the collection servers to return relevant items. We then introduced our own selection method, and based on previous results we confirmed that good collection selection can result when we select those collections able to return at least one relevant document at the top of their response lists. From a practical point of view our selection strategy relies on inspecting the top ranked documents returned by each collection in order to judge the usefulness a collection server. Our approach does not require the creation of any pre-stored metadata, and as such it does not need any up-dates to reflect changes in collection s content. Also, our selection scheme will eliminate those collections containing relevant documents unable to be included in the top retrieved documents. Our experiments were conducted using very short queries, similar to those submitted to

search engines and may therefore be considered as Web realistic. However, our selection strategy does require more transfer traffic in order to download the first n documents per collection, and thus response time may increase slightly. The optimal value of n seems however to be relatively small (around 3), meaning our selection approach did not cause very large downloading delays. Our evaluations also show that our selection procedure returns a reasonable number of collections, with a mean of 72.8% of collections being selected, compared to 94% for the CORI approach. The investigation described in this paper used the same search engine as each collection server, a context more closely reflecting that of a digital library environment, in which all resources are managed by the same search engine. Our current work will also consider the use of different collections, indexed and searched by various search engines. Acknowledgements The author would like to thank C. Buckley from SabIR for giving us the opportunity to use the SMART system, without which this study could not have been conducted. We would also thank Yves Rasolofo from University of Neuchatel for providing us the LMS merging program. This material is based on work supported in part by Région Rhône-Alpes (Eurodoc grant from F. Abbaci) and by SNSF (Swiss National Science Foundation, under grant #21-58 813.99, J. Savoy). References [1] Callan, J.P., Lu, Z., Croft, W.B.: Searching distributed collections with inference networks. Proceedings of the ACM- SIGIR'1995, pp. 21-28. [2] Chakravarthy, A.S., Haase, K.B.: NetSerf: Using semantic knowledge to find internet information archives. Proceeding of the ACM-SIGIR 1995, pp. 4-11. [3] Clarke, C.L.A., Cormack, G.V., Burkowski, F.J.: Shortest substring ranking (MultiText experiments for TREC-4). Proceedings of TREC-4, 1995, pp. 295-304. [4] Craswell, N., Bailey, P., Hawking, D.: Server selection in the world wide web. Proceedings ACM-DL'2000, pp. 37-46. [5] French, J.C., Powell, A.L., Callan, J., Viles, C.L., Emmitt, T., Prey, K.J, Mou, Y.: Comparing the performance of database selection algorithms. Procee-dings of ACM-SIGIR 1999, pp. 238-245. [6] Gravano, L., Garcia-Molina, H., Tomasic, A.: GlOSS: Text-source discovery over the Internet. ACM Transactions on Database Systems, 24(2), 1999, pp. 229-264. [7] Hawking, D., Thistlewaite, P.: Methods for information server selection. ACM Transactions on Information Systems, 17(1), 1999, pp. 40-76. [8] Larkey, L.S., Connell, M., Callan, J.: Collection selection and results merging with topically organized U.S. patents and TREC data. Proceedings ACM-CIKM'2000, pp. 282-289. [9] Lawrence, S., Lee Giles, C. : Inquirus, the NECI meta search engine. Proceedings WWW'7, 1998, pp. 95-105 [10] Moffat, A., Zobel, J.: Information retrieval systems for large document collections. Proceedings of TREC-3, 1995, pp. 85-94. [11] Ogilvie, P., Callan, J.: The effectiveness of query expansion for distributed information retrieval. Proceedings ACM-CIKM'2001, to appear. [12] Powel, A.L., French, J.C., Callan, J., Connell, M., Viles, C.L.: The impact of database selection on distributed searching. Proceedings of the ACM-SIGIR-2000, pp. 232-239. [13] Rasolofo, Y., Abbaci, F., Savoy, J.: Approaches to collection selection and results merging for distributed information retrieval. Proceedings ACM-CIKM'2001, to appear. [14] Robertson, S.E., Walker, S., Beaulieu, M.: Experimentation as a way of life: Okapi at TREC. Information Processing & Management, 36(1), 2000, pp. 95-108. [15] Towell, G., Voorhees, E.M., Narendra, K.G., Johnson- Laird, B.: Learning collection fusion strategies for information retrieval. Proceedings of The Twelfth Annual Machine Learning Conference, 1995, pp. 540-548. [16] Xu, J., Callan, J.P.: Effective retrieval with distributed collections. Proceedings of the ACM-SIGIR'1998, pp. 112-120. [17] Xu, J., Croft, W.B.: Cluster-based language models for distributed retrieval. Proceedings of ACM-SIGIR 1999, pp. 254-261. [18] Yu, C., Meng, W., Wu, W., Liu, K.-L.: Efficient and effective metasearch for text databases incorporating linkages among documents. Proceedings ACM-SIGMOD'2001, pp. 187-198. [19] Zobel, J.: Collection selection via lexicon inspection. Proceedings of the Second Australian Document Computing Symposium, 1997, pp. 74-80. Appendix 1 : Additional evaluations Average precision Difference with single Mean number of selected collection Single 19.60 n n_first 1 7 19.34-1.33% 7.00 2 12 19.58-0.10% 7.22 3 14 19.46-0.71% 6.86 4 15 19.61 +0.05% 6.62 5 21 19.47-0.66% 7.26 Table A: The best average precision obtained with our selection procedure Appendix 2. Search model equation The Okapi probabilistic model [14] calculates the weight of the term t within a document d as follows: w td = (k 1 + 1) tf td K + tf td

where l K = k ( 1 b) + b d advl - l d is the document length, - advl is the average of document length (set to 750), - b is a constant (set to 0.9), - k is a constant (set to 2), - k 1 is a constant (set to 1.2), - tf td is the occurrence frequency of the term t in document d. The following formula shows the weight assigned to a search keyword t within the query q: w tq = tf tq log n df t k 3 + tf tq df t where - tf tq is the search term frequency, - df t is the number of documents in the collection containing the term t, - n is the number of documents included to the collection. - k 3 is a constant (set to 1000).