A Methodology for Collection Selection in Heterogeneous Contexts

Size: px
Start display at page:

Download "A Methodology for Collection Selection in Heterogeneous Contexts"

Transcription

1 A Methodology for Collection Selection in Heterogeneous Contexts Faïza Abbaci Ecole des Mines de Saint-Etienne 158 Cours Fauriel, Saint-Etienne, France Jacques Savoy Université de Neuchâtel Pierre-à-Mazel 7, 2000 Neuchâtel, Suisse Michel Beigbeder Ecole des Mines de Saint-Etienne 158 Cours Fauriel, Saint-Etienne, France Abstract In this paper we demonstrate that in an ideal Distributed Information Retrieval environment, taking the ability of each collection server to return relevant documents into account when selecting collections can be effective. Based on this assumption, we suggest a new approach to resolve the collection selection problem. In order to predict a collection's ability to return relevant documents, we inspect a limited number n of documents retrieved from each collection and analyze the proximity of search keywords within them. In our experiments, we vary the underlying parameter n of our suggested model to define the most appropriate number of top documents to be inspected. Moreover, we evaluate the retrieval effectiveness of our approach and compare it with both the centralized indexing and the CORI approaches [1], [16]. Preliminary results from these experiments, conducted on WT10g test collection, tend to demonstrate that our suggested method can achieve appreciable retrieval effectiveness. Keywords Information retrieval, distributed information retrieval, collection selection, results merging strategy, evaluation. 1: Introduction Distributing storage and search processing seem to be the appropriate solutions to overcoming several limitations inherent in Centralized Information Retrieval (CIR) systems [7], particularly those due to the exponential growth in information available on the Internet. Distributed Information Retrieval (DIR) system architecture in its simplest form consists of various collection servers and a broker. Typically, the broker receives the user's query, forwards it to the carefully selected collection server subset most likely to contain relevant documents responding to this query (e.g., based on search keywords, collection statistics, query language or a subset of servers pre-selected by the user). Finally, the broker combines the individual result lists (submitted by each selected collection) in order to produce a single ranked list. While most previous studies found that carrying out a collection selection within DIR systems decreases effectiveness [16], some recent studies tend to demonstrate that when a good selection is provided, retrieval effectiveness in DIR systems has the potential of being just as effective as single CIR systems [12], [17]. In this paper, we will investigate how to select a subset of collection servers most likely to be relevant to a given request, and thus obtain improved retrieval performance. In this vein, there are three main differences between our collection selection approach and those of various other selection methods. Firstly, our approach does not use any pre-stored metadata to predict the collection's relevance to the query. Secondly, it does not require collection ranking: each collection is selected independently from the others. Thirdly, our approach does not require any collection or server cooperation. The rest of this paper is organized as follows. The next section describes previous works that attempt to resolve the collection selection problem. Section 3 illustrates the testbed we used in our experiments. Section 4 demonstrates that a collection's ability to return relevant documents serves can be a good criterion for defining a selection procedure. Our suggested approach is presented in Section 5. Section 6 discusses in detail the evaluations we carried out on the WT10g test collection, and compares our strategy's performance with that of other selection schemes. 2: Collection selection Collection selection can be performed automatically or manually, and for a given query it consists of selecting collections likely to contain relevant documents. Obviously, ignoring this step and sending the submitted request to all known collections is one possible solution (an approach usually chosen by novice users) and we will denote this approach as NS (for No-Selection). This method is however expensive in terms of resources, and it also increases user latency. When a proper collection selection is made it is possible to achieve results superior to that of the NS approach. Thus, the goal of collection

2 selection is to reduce the number of collections searched as much as possible, without decreasing retrieval effectiveness [5], [7]. In automatically selecting a subset of servers, most collection selection techniques compute a score for each collection, based on its usefulness to the query submitted. The collections are then ranked according to these scores, and the system might select either the N highest ranking collections or those collections with a score exceeding a given threshold. In order to calculate these collection scores however, the broker must have access to some information about a collection's contents, and the various approaches suggested differ, be it in the nature of the global information or in the manner in which it is acquired. Previous works have included collection descriptions [2] or collection statistics (frequency, co-occurrence, etc.) [6], [10], [19] in their collection selection methods, techniques that require a certain degree of collection cooperation. Xu & Croft [17] suggested selecting documents according to their topics, and also a language model associated with each of these topics. Callan et al. [1] or Xu & Callan [16] described a Collection Retrieval Inference network (CORI), in which each collection is considered as a single gigantic document. Ranking these collection documents used methods similar to those employed in conventional information retrieval systems. Larkey et al. [8] found that CORI works well when collections are organized topically. Several methods were developed by Zobel [19] to calculate collection scores, while Moffat & Zobel [10] suggested decomposing each collection into blocks of documents, where the blocks were indexed by the broker. The resulting index was then used to find blocks having high-ranking scores with respect to the query, and then collections were selected to match these blocks. The GlOSS system [6] ranked collections according to their appropriateness to the query submitted, estimating the number of documents in each collection for which query similarity was greater than a predefined threshold, and creating a collection score through summing these similarities. For a given query, Hawking & Thistlewaite [7] proposed broadcasting a probe query (containing one to three terms) to all available collections, each of which responding with term information would then be used to calculate collection score. Towell et al. [15] developed a learning method in which they determined the optimum number of documents to be retrieved from each collection, rather than defining the number of collections to be searched. In their calculation of collection scores, Craswell et al. [4] included the search engine's retrieval effectiveness for each collection. Ogilvie & Callan [11] tested the use of query expansion to resolve the collection selection problem, however their results were not conclusive. Yu et al. [18] ranked collections by incorporating information on linkages between documents in a Web environment. We believe the CORI to be a suitable representative of the above strategies, and will describe its selection procedure in more detail. We will then make evaluate it by making a comparison to the approach we used in our experiments. The CORI approach uses an inference network to rank collections. For the ith collection and for a given query q, the collection score was computed as: s i = 1 m s(t m j C i ) j=1 where s(t j C i ) indicates the contribution of the search term t j in the score of collection C i calculated as follows: C log df cf s(t j C i ) = defb + (1 defb) i j df i + K log C ( ) where K = k ( 1 b) + b lc i avlc - m is the number of terms included in query q, - C is the number of collections, - df i is the number of documents in collection C i containing the jth query term, - cf j is the number of collections containing the query term t j, - lc i is the number of indexing terms in C i, - avlc is the average number of indexing terms in each collection, and - defb, b and k are constants and, as suggested by Xu & Callan [16], were set at the following values: defb = 0.4, k = 200 and b = After ranking the collections according to their scores, one possibility is to select the N top ranked collections, where N can determined by the user. Another possibility is to use an algorithm to cluster the collection scores and then select those collections in the top clusters [1]. We evaluated the latter case using a cluster difference threshold α = : Our testbed Our experiments were conducted with the test collection WT10g containing 1,692,096 Web pages from sites around the world (total size = 11,033 MB) and which was used for the Web track of TREC9 conference. The queries used in our experiments were built from only topic titles, and thus correspond to real-life requests sent by users to the Excite search engine. Included were a variety of topics (e.g., Parkinson s disease, hunger, Baltimore, how benefits businesses or Mexican food culture ), and query lengths were rather short (a mean of 2.4 words and standard deviation of 0.6). In order to simulate distributed collections, we divided the testbed into eight collections, each having roughly the same number of documents and the same size. Table 1

3 depicts various statistics about these collections, including size, number of documents, and number of queries having at least one relevant item. For some requests there were no relevant documents returned by the collections, for example Queries #464 and #487 did not return any documents from any of the eight collections, due to spelling errors (Query #464 was written as nativityscenes and Query #487 as angioplast7 ). All collections were indexed by the SMART system, using the Okapi [14] probabilistic search model (see Appendix 2 for details). Collection Size (MB) # # Rel.- Query TREC9.1 TREC9.2 TREC9.3 TREC9.4 TREC9.5 TREC9.6 TREC9.7 TREC9.8 1,325 1,474 1,438 1,316 1,309 1,311 1,336 1, , , , , , , , , Table 1: Summary of our testbed statistics 4: Which is the best collection selection? Some recent experiments have shown that DIR systems can outperform CIR systems in term of average precision, if a good selection is provided [12], [17]. A good selection would be described as one in which noise 1 and silence 2 have decreased. The question is: how do you make a good selection? To achieve this objective of decreasing noise and silence, we might choose collections containing at least one relevant document. Thus, we would decrease noise by eliminating collections without any relevant items, and decrease silence by maximizing the number of relevant documents that the system may retrieve. We will denote such a selection approach as Optimal1. On the other hand, there may be collections that contain relevant documents but that use an ineffective retrieval strategy. This means placing relevant documents at the end of the returned list, meaning they have little chance of being consulted by the user. We believed that such a collection should be eliminated. Thus, a good selection strategy would select collections that contain and are able to return relevant items, and present them among the n top documents. We will denote such a selection approach as Optimal2(n). Before presenting our selection approach, we want to verify the following hypothesis: 1. Optimal1 produces better results that both NS and centralized (labeled Single in our tables) approaches. 1 Set of irrelevant documents returned by the search engine. 2. Optimal2(n) results in better retrieval performance than do both Optimal1 and the centralized approach. In order to verify the second assumption, we need some idea of the best value of the underlying parameter n. In order to verify these assumptions, and in order to process the Optimal1 and Optimal2(n) selection procedures, we used our test collection and our knowledge about their relevant items. It should be noted that the centralized approach retrieves documents from a single database (labeled Single in our tables), and that the NS approach selects all collections. The Optimal1 selection procedure selects all collections containing at least one relevant document, and the Optimal2(n) procedure selects all collections containing at least one relevant document in the n top documents. To produce a single result list from the various result lists provided by the collection servers, we adopted the raw-score merging procedure that merges individual result lists based on document scores [1], [5]. Table 2 shows the average precision achieved by the single, NS and Optimal1 selection procedures. Shown in parentheses in Tables 1 and 2 are percent changes in average precision compared to the single approach. As illustrated, Optimal1 provides better retrieval performance than do either the single approach (+7.7%) or the NS model (+9.2%, when comparing average precision: vs ). Our first assumption is therefore confirmed. Selection approach Average precision (48 queries) Single 19.60% NS 19.32% (-1.4%) Optimal % (+7.7%) Table 2: Average precision for three selection procedures In our second assumption we varied the n value from 1 to 5 as shown in Table 3. For reasons of economy, we did not test Optimal2(n) for n greater than 5. Note that for some topics, no collections were selected, therefore we only reported results from queries for which a least one collection was selected. Table 3 shows that for the WT10g corpus and for n > 1, Optimal2(n) manifests better retrieval performance than does Optimal1, and is therefore better than the single approach. As such, our second assumption is also confirmed. Now that our two assumptions have been established, it can be deduced that selecting collections containing at least one relevant document is a good selection strategy. An even better approach is selecting all collections able to return at least one relevant document among the top n documents. Based on these conclusions, the following section will propose our collection selection procedure. 2 Set of relevant documents not returned by the retrieval system.

4 Single Optimal1 Optimal2(n) n query (+6.3%) (+2.3%) (+6.3%) (+8.2%) (+5.2%) (+13.0%) (+7.2%) (+10.8%) (+7.4%) (+13.1%) Table 3: Average precision for centralized, Optimal1 and Optimal2(n) selection approaches 5: Our selection procedure The previous experiments showed that selecting collections able to return at least one relevant document among the top n retrieved documents is a good approach. The question must then be asked: how can we know if a collection is able to return relevant documents or not? In practice, to achieve this objective we estimate the relevance of the n first documents returned by each collection and then select collections having returned at least one relevant document. This selection procedure will be denoted as TRD-CS (using Top Ranked Documents for Collection Selection ), an approach differing from its predecessors in that it does not assign scores to each collection. It does bear a slight resemblance to the approach developed by Hawking & Thistlewaite [7], because they both assume that a priori no information is available to perform the selection, the required information being obtained during query processing. However, Hawking & Thistlewaite's [7] proposition is unrealistic in the sense that it relies on the widespread adoption of protocols allowing servers to communicate statistics or metadata about their holdings. On the other hand, our method is linked with that of Craswell's et al. [4] because it takes server effectiveness into account. The main guidelines of our approach can be summarized as follows: A broker broadcasts a query to all available servers (denoted C ) which returns n ranked documents to the broker. The broker then calculates the score for each received document (or for n. C items), and sorts them according to their scores. Finally, the collections matching the n_first documents are selected. When we define a document's score, we assume that the following hints can be used as good relevance indicators: the number of search keywords included in each document surrogate, the distance between two query terms in the document and finally, their occurrence frequencies in the document. From this perspective and inspired by Lawrence & Lee Giles [9], we then calculate document scores as follows: nb_ occ score ( d,q ) = ( c 1 nb d ) + ( c 2 dis _ ind( d,q) )+ c 3 where for each document d: - nb d is the number of search keywords included in the document d, - nb_occ is the total number of occurrences of query terms in d, - dis_ind is the indicator of distance between two query terms in d. This function returns a real value greater or equal to zero, - c 1, c 2, c 3 are constants and set to c 1 = 100, c 2 = 1000, c 3 = 1000 in our experiments. According to the formula introduced by Clarke et al. [3] and assuming that the first two query terms are the most important search keywords, we will then compute dis_ind only for these two terms, as follows: dis _ ind( d,q ) = dis(k,l ) i i where - k and l are search keyword positions within the document d delimiting by the ith block, - dis(k,l) i is the score for this block in the document d, which satisfies the query q (i.e., the block contains the first two query terms in our case), and does not include any other smaller block satisfying the query (only the block having the smallest size is selected). For example, consider a given query consisting of two terms t i and t j. If t i appears in the 5th and 25th positions and t j in the 27th position, we may find the first block (k = 5 and l = 27) and the second block (k = 25 and l = 27). As the first block contains the second, this first block is ignored and dist_ind is therefore reduced to dist_ind(d,q) = dis(25,27) = 0.5. More formally, dist(k,l) i is defined as: 1 if ( k,l) > 1 dis (k,l ) i = ( k,l) i i 1 if ( k,l) 1 i In the case of a mono-term query, and according to [9], dis_ind represents the inverse of the distance from the start of the document to the first occurrence of this unique search keyword. Finally, a document obtains a zero value if it does not contain any query terms. 6: Evaluation In our evaluations, and for comparative purposes, we will refer to the centralized approach as a baseline. As a merging procedure, we used the LMS merging strategy ( using result Length to define Merging Score ) [13], where for merging the retrieved items the length of result lists is chosen because it results in the most effective retrieval when used with various selection approaches [13]. Table 4 depicts the average precision achieved by the single, CORI, NS and also our own selection procedure. The second column lists the average precision achieved by the various approaches and the following columns the

5 Average Precision after : Mean number of precision selected collections Single CORI NS TRD-CS n = 1 (n_first = 6) n = 2 (n_first = 9) n = 3 (n_first = 9) n = 4 n_first = 12) n = 5(n_first = 13) Table 4: Comparison of average precision achieved by various selection approaches precision after retrieving 5, 10, 15, 20, 30 and 100 documents. We believe these performance measures are useful because the typical user will inspect only the top retrieved items. Finally, for us it is important to know the mean number of selected collections, and these value are depicted in the last column. In our testbed, an average of 4.9 collections over a maximum value of 8 should be selected for each query. Therefore a value near 4.9 in the last column can be viewed as good selection performance. The results depicted in Table 4 were obtained by varying the parameter n (the number of items from each collection to be inspected) as well the parameter n_first (number of documents used as a basis for selecting the underlying collections). The best retrieval performance for our selection approach is given in Appendix 1. Table 4 uses the following typographical conventions to present the results of our evaluation. Results in bold are significantly better, thus a difference of 5% in average performance is considered as significant compared to that of the single approach retrieval performance. Those in italic represent performance significantly inferior when compared to the single model. Finally, regular type denotes no significant difference in retrieval effectiveness. From the data shown in Table 4, we can infer the following conclusions. In reviewing the average precision shown in the second column, none of the retrieval models prove to have significantly better or worse performance than does the single approach. As for precision achieved after retrieving 5, 15 or 20 documents, our selection approach (whatever the value of the parameters n or n_first) usually results in significantly better retrieval effectiveness than the single approach. In such cases our selection approach also shows improved retrieval performance than do the CORI or NS models. When reviewing the precision achieved after retrieving 100 documents however, the results for our selection model degrade slightly. However, Web users will typically not inspect result lists beyond the first 20 retrieved items. Overall, when comparing results obtained by our selection scheme for different values of n, it seems that the best value for this parameter is around n = 3. For our second parameter, the best choice seems to be around n_first = 10 (also see Appendix 1 for the best parameter values obtained when using only average precision). For our model, when the value of the parameter n is increased beyond 3, Table 4 indicates a decrease in retrieval performance. 7: Conclusion In this paper we discussed responses to the following question: What should be considered as good selection in an ideally distributed environment, where knowledge of the entire set of relevant documents is available? To respond to this question we introduced two methods denoted Optimal1 and Optimal2(n). Through experiments conducted on the WT10g test collection, we demonstrated that these two selection approaches provide better retrieval than does the centralized approach. Moreover, the Optimal2(n) selection procedure represents a better choice because it takes into account the ability of the collection servers to return relevant items. We then introduced our own selection method, and based on previous results we confirmed that good collection selection can result when we select those collections able to return at least one relevant document at the top of their response lists. From a practical point of view our selection strategy relies on inspecting the top ranked documents returned by each collection in order to judge the usefulness a collection server. Our approach does not require the creation of any pre-stored metadata, and as such it does not need any up-dates to reflect changes in collection s content. Also, our selection scheme will eliminate those collections containing relevant documents unable to be included in the top retrieved documents. Our experiments were conducted using very short queries, similar to those submitted to

6 search engines and may therefore be considered as Web realistic. However, our selection strategy does require more transfer traffic in order to download the first n documents per collection, and thus response time may increase slightly. The optimal value of n seems however to be relatively small (around 3), meaning our selection approach did not cause very large downloading delays. Our evaluations also show that our selection procedure returns a reasonable number of collections, with a mean of 72.8% of collections being selected, compared to 94% for the CORI approach. The investigation described in this paper used the same search engine as each collection server, a context more closely reflecting that of a digital library environment, in which all resources are managed by the same search engine. Our current work will also consider the use of different collections, indexed and searched by various search engines. Acknowledgements The author would like to thank C. Buckley from SabIR for giving us the opportunity to use the SMART system, without which this study could not have been conducted. We would also thank Yves Rasolofo from University of Neuchatel for providing us the LMS merging program. This material is based on work supported in part by Région Rhône-Alpes (Eurodoc grant from F. Abbaci) and by SNSF (Swiss National Science Foundation, under grant # , J. Savoy). References [1] Callan, J.P., Lu, Z., Croft, W.B.: Searching distributed collections with inference networks. Proceedings of the ACM- SIGIR'1995, pp [2] Chakravarthy, A.S., Haase, K.B.: NetSerf: Using semantic knowledge to find internet information archives. Proceeding of the ACM-SIGIR 1995, pp [3] Clarke, C.L.A., Cormack, G.V., Burkowski, F.J.: Shortest substring ranking (MultiText experiments for TREC-4). Proceedings of TREC-4, 1995, pp [4] Craswell, N., Bailey, P., Hawking, D.: Server selection in the world wide web. Proceedings ACM-DL'2000, pp [5] French, J.C., Powell, A.L., Callan, J., Viles, C.L., Emmitt, T., Prey, K.J, Mou, Y.: Comparing the performance of database selection algorithms. Procee-dings of ACM-SIGIR 1999, pp [6] Gravano, L., Garcia-Molina, H., Tomasic, A.: GlOSS: Text-source discovery over the Internet. ACM Transactions on Database Systems, 24(2), 1999, pp [7] Hawking, D., Thistlewaite, P.: Methods for information server selection. ACM Transactions on Information Systems, 17(1), 1999, pp [8] Larkey, L.S., Connell, M., Callan, J.: Collection selection and results merging with topically organized U.S. patents and TREC data. Proceedings ACM-CIKM'2000, pp [9] Lawrence, S., Lee Giles, C. : Inquirus, the NECI meta search engine. Proceedings WWW'7, 1998, pp [10] Moffat, A., Zobel, J.: Information retrieval systems for large document collections. Proceedings of TREC-3, 1995, pp [11] Ogilvie, P., Callan, J.: The effectiveness of query expansion for distributed information retrieval. Proceedings ACM-CIKM'2001, to appear. [12] Powel, A.L., French, J.C., Callan, J., Connell, M., Viles, C.L.: The impact of database selection on distributed searching. Proceedings of the ACM-SIGIR-2000, pp [13] Rasolofo, Y., Abbaci, F., Savoy, J.: Approaches to collection selection and results merging for distributed information retrieval. Proceedings ACM-CIKM'2001, to appear. [14] Robertson, S.E., Walker, S., Beaulieu, M.: Experimentation as a way of life: Okapi at TREC. Information Processing & Management, 36(1), 2000, pp [15] Towell, G., Voorhees, E.M., Narendra, K.G., Johnson- Laird, B.: Learning collection fusion strategies for information retrieval. Proceedings of The Twelfth Annual Machine Learning Conference, 1995, pp [16] Xu, J., Callan, J.P.: Effective retrieval with distributed collections. Proceedings of the ACM-SIGIR'1998, pp [17] Xu, J., Croft, W.B.: Cluster-based language models for distributed retrieval. Proceedings of ACM-SIGIR 1999, pp [18] Yu, C., Meng, W., Wu, W., Liu, K.-L.: Efficient and effective metasearch for text databases incorporating linkages among documents. Proceedings ACM-SIGMOD'2001, pp [19] Zobel, J.: Collection selection via lexicon inspection. Proceedings of the Second Australian Document Computing Symposium, 1997, pp Appendix 1 : Additional evaluations Average precision Difference with single Mean number of selected collection Single n n_first % % % % % 7.26 Table A: The best average precision obtained with our selection procedure Appendix 2. Search model equation The Okapi probabilistic model [14] calculates the weight of the term t within a document d as follows: w td = (k 1 + 1) tf td K + tf td

7 where l K = k ( 1 b) + b d advl - l d is the document length, - advl is the average of document length (set to 750), - b is a constant (set to 0.9), - k is a constant (set to 2), - k 1 is a constant (set to 1.2), - tf td is the occurrence frequency of the term t in document d. The following formula shows the weight assigned to a search keyword t within the query q: w tq = tf tq log n df t k 3 + tf tq df t where - tf tq is the search term frequency, - df t is the number of documents in the collection containing the term t, - n is the number of documents included to the collection. - k 3 is a constant (set to 1000).

Approaches to Collection Selection and Results Merging for Distributed Information Retrieval

Approaches to Collection Selection and Results Merging for Distributed Information Retrieval ACM, 21. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Conference on Information

More information

Obtaining Language Models of Web Collections Using Query-Based Sampling Techniques

Obtaining Language Models of Web Collections Using Query-Based Sampling Techniques -7695-1435-9/2 $17. (c) 22 IEEE 1 Obtaining Language Models of Web Collections Using Query-Based Sampling Techniques Gary A. Monroe James C. French Allison L. Powell Department of Computer Science University

More information

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16 Federated Search Jaime Arguello INLS 509: Information Retrieval jarguell@email.unc.edu November 21, 2016 Up to this point... Classic information retrieval search from a single centralized index all ueries

More information

Federated Text Search

Federated Text Search CS54701 Federated Text Search Luo Si Department of Computer Science Purdue University Abstract Outline Introduction to federated search Main research problems Resource Representation Resource Selection

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Federated Search Prof. Chris Clifton 13 November 2017 Federated Search Outline Introduction to federated search Main research problems Resource Representation

More information

Combining CORI and the decision-theoretic approach for advanced resource selection

Combining CORI and the decision-theoretic approach for advanced resource selection Combining CORI and the decision-theoretic approach for advanced resource selection Henrik Nottelmann and Norbert Fuhr Institute of Informatics and Interactive Systems, University of Duisburg-Essen, 47048

More information

Collection Selection and Results Merging with Topically Organized U.S. Patents and TREC Data

Collection Selection and Results Merging with Topically Organized U.S. Patents and TREC Data Collection Selection and Results Merging with Topically Organized U.S. Patents and TREC Data Leah S. Larkey, Margaret E. Connell Department of Computer Science University of Massachusetts Amherst, MA 13

More information

CS54701: Information Retrieval

CS54701: Information Retrieval CS54701: Information Retrieval Federated Search 10 March 2016 Prof. Chris Clifton Outline Federated Search Introduction to federated search Main research problems Resource Representation Resource Selection

More information

ABSTRACT. Categories & Subject Descriptors: H.3.3 [Information Search and Retrieval]: General Terms: Algorithms Keywords: Resource Selection

ABSTRACT. Categories & Subject Descriptors: H.3.3 [Information Search and Retrieval]: General Terms: Algorithms Keywords: Resource Selection Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 lsi@cs.cmu.edu, callan@cs.cmu.edu

More information

DATABASE MERGING STRATEGY BASED ON LOGISTIC REGRESSION

DATABASE MERGING STRATEGY BASED ON LOGISTIC REGRESSION DATABASE MERGING STRATEGY BASED ON LOGISTIC REGRESSION Anne Le Calvé, Jacques Savoy Institut interfacultaire d'informatique Université de Neuchâtel (Switzerland) e-mail: {Anne.Lecalve, Jacques.Savoy}@seco.unine.ch

More information

Term Frequency Normalisation Tuning for BM25 and DFR Models

Term Frequency Normalisation Tuning for BM25 and DFR Models Term Frequency Normalisation Tuning for BM25 and DFR Models Ben He and Iadh Ounis Department of Computing Science University of Glasgow United Kingdom Abstract. The term frequency normalisation parameter

More information

Result merging strategies for a current news metasearcher

Result merging strategies for a current news metasearcher Information Processing and Management 39 (2003) 581 609 www.elsevier.com/locate/infoproman Result merging strategies for a current news metasearcher Yves Rasolofo a, *, David Hawking b, Jacques Savoy a

More information

number of documents in global result list

number of documents in global result list Comparison of different Collection Fusion Models in Distributed Information Retrieval Alexander Steidinger Department of Computer Science Free University of Berlin Abstract Distributed information retrieval

More information

Making Retrieval Faster Through Document Clustering

Making Retrieval Faster Through Document Clustering R E S E A R C H R E P O R T I D I A P Making Retrieval Faster Through Document Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-02 January 23, 2004 D a l l e M o l l e I n s t i t u t e

More information

Evaluation of Meta-Search Engine Merge Algorithms

Evaluation of Meta-Search Engine Merge Algorithms 2008 International Conference on Internet Computing in Science and Engineering Evaluation of Meta-Search Engine Merge Algorithms Chunshuang Liu, Zhiqiang Zhang,2, Xiaoqin Xie 2, TingTing Liang School of

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

Report on the TREC-4 Experiment: Combining Probabilistic and Vector-Space Schemes

Report on the TREC-4 Experiment: Combining Probabilistic and Vector-Space Schemes Report on the TREC-4 Experiment: Combining Probabilistic and Vector-Space Schemes Jacques Savoy, Melchior Ndarugendamwo, Dana Vrajitoru Faculté de droit et des sciences économiques Université de Neuchâtel

More information

Melbourne University at the 2006 Terabyte Track

Melbourne University at the 2006 Terabyte Track Melbourne University at the 2006 Terabyte Track Vo Ngoc Anh William Webber Alistair Moffat Department of Computer Science and Software Engineering The University of Melbourne Victoria 3010, Australia Abstract:

More information

Frontiers in Web Data Management

Frontiers in Web Data Management Frontiers in Web Data Management Junghoo John Cho UCLA Computer Science Department Los Angeles, CA 90095 cho@cs.ucla.edu Abstract In the last decade, the Web has become a primary source of information

More information

TREC-10 Web Track Experiments at MSRA

TREC-10 Web Track Experiments at MSRA TREC-10 Web Track Experiments at MSRA Jianfeng Gao*, Guihong Cao #, Hongzhao He #, Min Zhang ##, Jian-Yun Nie**, Stephen Walker*, Stephen Robertson* * Microsoft Research, {jfgao,sw,ser}@microsoft.com **

More information

RMIT University at TREC 2006: Terabyte Track

RMIT University at TREC 2006: Terabyte Track RMIT University at TREC 2006: Terabyte Track Steven Garcia Falk Scholer Nicholas Lester Milad Shokouhi School of Computer Science and IT RMIT University, GPO Box 2476V Melbourne 3001, Australia 1 Introduction

More information

Federated Text Retrieval From Uncooperative Overlapped Collections

Federated Text Retrieval From Uncooperative Overlapped Collections Session 2: Collection Representation in Distributed IR Federated Text Retrieval From Uncooperative Overlapped Collections ABSTRACT Milad Shokouhi School of Computer Science and Information Technology,

More information

An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments

An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments Hui Fang ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign Abstract In this paper, we report

More information

From Passages into Elements in XML Retrieval

From Passages into Elements in XML Retrieval From Passages into Elements in XML Retrieval Kelly Y. Itakura David R. Cheriton School of Computer Science, University of Waterloo 200 Univ. Ave. W. Waterloo, ON, Canada yitakura@cs.uwaterloo.ca Charles

More information

A Practical Passage-based Approach for Chinese Document Retrieval

A Practical Passage-based Approach for Chinese Document Retrieval A Practical Passage-based Approach for Chinese Document Retrieval Szu-Yuan Chi 1, Chung-Li Hsiao 1, Lee-Feng Chien 1,2 1. Department of Information Management, National Taiwan University 2. Institute of

More information

A Formal Approach to Score Normalization for Meta-search

A Formal Approach to Score Normalization for Meta-search A Formal Approach to Score Normalization for Meta-search R. Manmatha and H. Sever Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst, MA 01003

More information

Robust Relevance-Based Language Models

Robust Relevance-Based Language Models Robust Relevance-Based Language Models Xiaoyan Li Department of Computer Science, Mount Holyoke College 50 College Street, South Hadley, MA 01075, USA Email: xli@mtholyoke.edu ABSTRACT We propose a new

More information

Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied

Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied Information Processing and Management 43 (2007) 1044 1058 www.elsevier.com/locate/infoproman Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied Anselm Spoerri

More information

Learning Collection Fusion Strategies for Information Retrieval

Learning Collection Fusion Strategies for Information Retrieval Appears in Proceedings of the Twelfth Annual Machine Learning Conference, Lake Tahoe, July 1995 Learning Collection Fusion Strategies for Information Retrieval Geoffrey Towell Ellen M. Voorhees Narendra

More information

Study on Merging Multiple Results from Information Retrieval System

Study on Merging Multiple Results from Information Retrieval System Proceedings of the Third NTCIR Workshop Study on Merging Multiple Results from Information Retrieval System Hiromi itoh OZAKU, Masao UTIAMA, Hitoshi ISAHARA Communications Research Laboratory 2-2-2 Hikaridai,

More information

Query Likelihood with Negative Query Generation

Query Likelihood with Negative Query Generation Query Likelihood with Negative Query Generation Yuanhua Lv Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 ylv2@uiuc.edu ChengXiang Zhai Department of Computer

More information

Hierarchical Location and Topic Based Query Expansion

Hierarchical Location and Topic Based Query Expansion Hierarchical Location and Topic Based Query Expansion Shu Huang 1 Qiankun Zhao 2 Prasenjit Mitra 1 C. Lee Giles 1 Information Sciences and Technology 1 AOL Research Lab 2 Pennsylvania State University

More information

A Meta-search Method with Clustering and Term Correlation

A Meta-search Method with Clustering and Term Correlation A Meta-search Method with Clustering and Term Correlation Dyce Jing Zhao, Dik Lun Lee, and Qiong Luo Department of Computer Science Hong Kong University of Science & Technology {zhaojing,dlee,luo}@cs.ust.hk

More information

Noisy Text Clustering

Noisy Text Clustering R E S E A R C H R E P O R T Noisy Text Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-31 I D I A P December 2004 1 IDIAP, CP 592, 1920 Martigny, Switzerland, grangier@idiap.ch 2 IDIAP,

More information

IITH at CLEF 2017: Finding Relevant Tweets for Cultural Events

IITH at CLEF 2017: Finding Relevant Tweets for Cultural Events IITH at CLEF 2017: Finding Relevant Tweets for Cultural Events Sreekanth Madisetty and Maunendra Sankar Desarkar Department of CSE, IIT Hyderabad, Hyderabad, India {cs15resch11006, maunendra}@iith.ac.in

More information

Methods for Information Server Selection

Methods for Information Server Selection Methods for Information Server Selection DAVID HAWKING and PAUL THISTLEWAITE Australian National University The problem of using a broker to select a subset of available information servers in order to

More information

GlOSS: Text-Source Discovery over the Internet

GlOSS: Text-Source Discovery over the Internet GlOSS: Text-Source Discovery over the Internet LUIS GRAVANO Columbia University HÉCTOR GARCÍA-MOLINA Stanford University and ANTHONY TOMASIC INRIA Rocquencourt The dramatic growth of the Internet has created

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Exploiting Index Pruning Methods for Clustering XML Collections

Exploiting Index Pruning Methods for Clustering XML Collections Exploiting Index Pruning Methods for Clustering XML Collections Ismail Sengor Altingovde, Duygu Atilgan and Özgür Ulusoy Department of Computer Engineering, Bilkent University, Ankara, Turkey {ismaila,

More information

Reducing Redundancy with Anchor Text and Spam Priors

Reducing Redundancy with Anchor Text and Spam Priors Reducing Redundancy with Anchor Text and Spam Priors Marijn Koolen 1 Jaap Kamps 1,2 1 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 2 ISLA, Informatics Institute, University

More information

[31] T. W. Yan, and H. Garcia-Molina. SIFT { A Tool for Wide-Area Information Dissemination. USENIX

[31] T. W. Yan, and H. Garcia-Molina. SIFT { A Tool for Wide-Area Information Dissemination. USENIX [29] J. Xu, and J. Callan. Eective Retrieval with Distributed Collections. ACM SIGIR Conference, 1998. [30] J. Xu, and B. Croft. Cluster-based Language Models for Distributed Retrieval. ACM SIGIR Conference

More information

Distributed similarity search algorithm in distributed heterogeneous multimedia databases

Distributed similarity search algorithm in distributed heterogeneous multimedia databases Information Processing Letters 75 (2000) 35 42 Distributed similarity search algorithm in distributed heterogeneous multimedia databases Ju-Hong Lee a,1, Deok-Hwan Kim a,2, Seok-Lyong Lee a,3, Chin-Wan

More information

Capturing Collection Size for Distributed Non-Cooperative Retrieval

Capturing Collection Size for Distributed Non-Cooperative Retrieval Capturing Collection Size for Distributed Non-Cooperative Retrieval Milad Shokouhi Justin Zobel Falk Scholer S.M.M. Tahaghoghi School of Computer Science and Information Technology, RMIT University, Melbourne,

More information

Fondazione Ugo Bordoni at TREC 2004

Fondazione Ugo Bordoni at TREC 2004 Fondazione Ugo Bordoni at TREC 2004 Giambattista Amati, Claudio Carpineto, and Giovanni Romano Fondazione Ugo Bordoni Rome Italy Abstract Our participation in TREC 2004 aims to extend and improve the use

More information

University of Delaware at Diversity Task of Web Track 2010

University of Delaware at Diversity Task of Web Track 2010 University of Delaware at Diversity Task of Web Track 2010 Wei Zheng 1, Xuanhui Wang 2, and Hui Fang 1 1 Department of ECE, University of Delaware 2 Yahoo! Abstract We report our systems and experiments

More information

An Attempt to Identify Weakest and Strongest Queries

An Attempt to Identify Weakest and Strongest Queries An Attempt to Identify Weakest and Strongest Queries K. L. Kwok Queens College, City University of NY 65-30 Kissena Boulevard Flushing, NY 11367, USA kwok@ir.cs.qc.edu ABSTRACT We explore some term statistics

More information

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long

More information

Tilburg University. Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval

Tilburg University. Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval Tilburg University Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval Publication date: 2006 Link to publication Citation for published

More information

Verbose Query Reduction by Learning to Rank for Social Book Search Track

Verbose Query Reduction by Learning to Rank for Social Book Search Track Verbose Query Reduction by Learning to Rank for Social Book Search Track Messaoud CHAA 1,2, Omar NOUALI 1, Patrice BELLOT 3 1 Research Center on Scientific and Technical Information 05 rue des 03 frères

More information

Weighted Suffix Tree Document Model for Web Documents Clustering

Weighted Suffix Tree Document Model for Web Documents Clustering ISBN 978-952-5726-09-1 (Print) Proceedings of the Second International Symposium on Networking and Network Security (ISNNS 10) Jinggangshan, P. R. China, 2-4, April. 2010, pp. 165-169 Weighted Suffix Tree

More information

A Study of Collection-based Features for Adapting the Balance Parameter in Pseudo Relevance Feedback

A Study of Collection-based Features for Adapting the Balance Parameter in Pseudo Relevance Feedback A Study of Collection-based Features for Adapting the Balance Parameter in Pseudo Relevance Feedback Ye Meng 1, Peng Zhang 1, Dawei Song 1,2, and Yuexian Hou 1 1 Tianjin Key Laboratory of Cognitive Computing

More information

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback RMIT @ TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback Ameer Albahem ameer.albahem@rmit.edu.au Lawrence Cavedon lawrence.cavedon@rmit.edu.au Damiano

More information

Methods for Distributed Information Retrieval Nicholas Eric Craswell

Methods for Distributed Information Retrieval Nicholas Eric Craswell Methods for Distributed Information Retrieval Nicholas Eric Craswell A thesis submitted for the degree of Doctor of Philosophy at The Australian National University May 2000 c Nicholas Eric Craswell Typeset

More information

Static Index Pruning for Information Retrieval Systems: A Posting-Based Approach

Static Index Pruning for Information Retrieval Systems: A Posting-Based Approach Static Index Pruning for Information Retrieval Systems: A Posting-Based Approach Linh Thai Nguyen Department of Computer Science Illinois Institute of Technology Chicago, IL 60616 USA +1-312-567-5330 nguylin@iit.edu

More information

A Topic-based Measure of Resource Description Quality for Distributed Information Retrieval

A Topic-based Measure of Resource Description Quality for Distributed Information Retrieval A Topic-based Measure of Resource Description Quality for Distributed Information Retrieval Mark Baillie 1, Mark J. Carman 2, and Fabio Crestani 2 1 CIS Dept., University of Strathclyde, Glasgow, UK mb@cis.strath.ac.uk

More information

Implementing a customised meta-search interface for user query personalisation

Implementing a customised meta-search interface for user query personalisation Implementing a customised meta-search interface for user query personalisation I. Anagnostopoulos, I. Psoroulas, V. Loumos and E. Kayafas Electrical and Computer Engineering Department, National Technical

More information

A User Profiles Acquiring Approach Using Pseudo-Relevance Feedback

A User Profiles Acquiring Approach Using Pseudo-Relevance Feedback A User Profiles Acquiring Approach Using Pseudo-Relevance Feedback Xiaohui Tao and Yuefeng Li Faculty of Science & Technology, Queensland University of Technology, Australia {x.tao, y2.li}@qut.edu.au Abstract.

More information

Query Expansion with the Minimum User Feedback by Transductive Learning

Query Expansion with the Minimum User Feedback by Transductive Learning Query Expansion with the Minimum User Feedback by Transductive Learning Masayuki OKABE Information and Media Center Toyohashi University of Technology Aichi, 441-8580, Japan okabe@imc.tut.ac.jp Kyoji UMEMURA

More information

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Masaki Eto Gakushuin Women s College Tokyo, Japan masaki.eto@gakushuin.ac.jp Abstract. To improve the search performance

More information

Distributed Information Retrieval

Distributed Information Retrieval Distributed Information Retrieval Fabio Crestani and Ilya Markov University of Lugano, Switzerland Fabio Crestani and Ilya Markov Distributed Information Retrieval 1 Outline Motivations Deep Web Federated

More information

TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University

TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University of Maryland, College Park, MD 20742 oard@glue.umd.edu

More information

Automatically Generating Queries for Prior Art Search

Automatically Generating Queries for Prior Art Search Automatically Generating Queries for Prior Art Search Erik Graf, Leif Azzopardi, Keith van Rijsbergen University of Glasgow {graf,leif,keith}@dcs.gla.ac.uk Abstract This report outlines our participation

More information

Improving Difficult Queries by Leveraging Clusters in Term Graph

Improving Difficult Queries by Leveraging Clusters in Term Graph Improving Difficult Queries by Leveraging Clusters in Term Graph Rajul Anand and Alexander Kotov Department of Computer Science, Wayne State University, Detroit MI 48226, USA {rajulanand,kotov}@wayne.edu

More information

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LVII, Number 4, 2012 CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL IOAN BADARINZA AND ADRIAN STERCA Abstract. In this paper

More information

Partial Collection Replication For Information Retrieval

Partial Collection Replication For Information Retrieval Partial Collection Replication For Information Retrieval Zhihong Lu Kathryn S. McKinley AT&T Laboratories Deptartment of Computer Sciences 200 South Laurel Ave. University of Texas Middletown, New Jersey

More information

An Investigation of Basic Retrieval Models for the Dynamic Domain Task

An Investigation of Basic Retrieval Models for the Dynamic Domain Task An Investigation of Basic Retrieval Models for the Dynamic Domain Task Razieh Rahimi and Grace Hui Yang Department of Computer Science, Georgetown University rr1042@georgetown.edu, huiyang@cs.georgetown.edu

More information

A NEW PERFORMANCE EVALUATION TECHNIQUE FOR WEB INFORMATION RETRIEVAL SYSTEMS

A NEW PERFORMANCE EVALUATION TECHNIQUE FOR WEB INFORMATION RETRIEVAL SYSTEMS A NEW PERFORMANCE EVALUATION TECHNIQUE FOR WEB INFORMATION RETRIEVAL SYSTEMS Fidel Cacheda, Francisco Puentes, Victor Carneiro Department of Information and Communications Technologies, University of A

More information

Maximal Termsets as a Query Structuring Mechanism

Maximal Termsets as a Query Structuring Mechanism Maximal Termsets as a Query Structuring Mechanism ABSTRACT Bruno Pôssas Federal University of Minas Gerais 30161-970 Belo Horizonte-MG, Brazil bavep@dcc.ufmg.br Berthier Ribeiro-Neto Federal University

More information

Keywords APSE: Advanced Preferred Search Engine, Google Android Platform, Search Engine, Click-through data, Location and Content Concepts.

Keywords APSE: Advanced Preferred Search Engine, Google Android Platform, Search Engine, Click-through data, Location and Content Concepts. Volume 5, Issue 3, March 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Advanced Preferred

More information

Evaluating Relevance Ranking Strategies for MEDLINE Retrieval

Evaluating Relevance Ranking Strategies for MEDLINE Retrieval 32 Lu et al., Evaluating Relevance Ranking Strategies Application of Information Technology Evaluating Relevance Ranking Strategies for MEDLINE Retrieval ZHIYONG LU, PHD, WON KIM, PHD, W. JOHN WILBUR,

More information

Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets

Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets 2016 IEEE 16th International Conference on Data Mining Workshops Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets Teruaki Hayashi Department of Systems Innovation

More information

RSDC 09: Tag Recommendation Using Keywords and Association Rules

RSDC 09: Tag Recommendation Using Keywords and Association Rules RSDC 09: Tag Recommendation Using Keywords and Association Rules Jian Wang, Liangjie Hong and Brian D. Davison Department of Computer Science and Engineering Lehigh University, Bethlehem, PA 18015 USA

More information

UMass at TREC 2017 Common Core Track

UMass at TREC 2017 Common Core Track UMass at TREC 2017 Common Core Track Qingyao Ai, Hamed Zamani, Stephen Harding, Shahrzad Naseri, James Allan and W. Bruce Croft Center for Intelligent Information Retrieval College of Information and Computer

More information

Huffman-DHT: Index Structure Refinement Scheme for P2P Information Retrieval

Huffman-DHT: Index Structure Refinement Scheme for P2P Information Retrieval International Symposium on Applications and the Internet Huffman-DHT: Index Structure Refinement Scheme for P2P Information Retrieval Hisashi Kurasawa The University of Tokyo 2-1-2 Hitotsubashi, Chiyoda-ku,

More information

Automatic Structured Query Transformation Over Distributed Digital Libraries

Automatic Structured Query Transformation Over Distributed Digital Libraries Automatic Structured Query Transformation Over Distributed Digital Libraries M. Elena Renda I.I.T. C.N.R. and Scuola Superiore Sant Anna I-56100 Pisa, Italy elena.renda@iit.cnr.it Umberto Straccia I.S.T.I.

More information

Term-Specific Smoothing for the Language Modeling Approach to Information Retrieval: The Importance of a Query Term

Term-Specific Smoothing for the Language Modeling Approach to Information Retrieval: The Importance of a Query Term Term-Specific Smoothing for the Language Modeling Approach to Information Retrieval: The Importance of a Query Term Djoerd Hiemstra University of Twente, Centre for Telematics and Information Technology

More information

Evaluatation of Integration algorithms for Meta-Search Engine

Evaluatation of Integration algorithms for Meta-Search Engine Evaluatation of Integration algorithms for Meta-Search Engine Bhavsingh Maloth Department of Information Technology, Guru Nanak Engineering College Approved By A.I.C.T.E,New Delhi Affiliated to Jawaharlal

More information

Siemens TREC-4 Report: Further Experiments with Database. Merging. Ellen M. Voorhees. Siemens Corporate Research, Inc.

Siemens TREC-4 Report: Further Experiments with Database. Merging. Ellen M. Voorhees. Siemens Corporate Research, Inc. Siemens TREC-4 Report: Further Experiments with Database Merging Ellen M. Voorhees Siemens Corporate Research, Inc. Princeton, NJ ellen@scr.siemens.com Abstract A database merging technique is a strategy

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

indexing and query processing. The inverted le was constructed for the retrieval target collection which contains full texts of two years' Japanese pa

indexing and query processing. The inverted le was constructed for the retrieval target collection which contains full texts of two years' Japanese pa Term Distillation in Patent Retrieval Hideo Itoh Hiroko Mano Yasushi Ogawa Software R&D Group, RICOH Co., Ltd. 1-1-17 Koishikawa, Bunkyo-ku, Tokyo 112-0002, JAPAN fhideo,mano,yogawag@src.ricoh.co.jp Abstract

More information

Collection Selection with Highly Discriminative Keys

Collection Selection with Highly Discriminative Keys Collection Selection with Highly Discriminative Keys Sander Bockting Avanade Netherlands B.V. Versterkerstraat 6 1322 AP, Almere, Netherlands sander.bockting@avanade.com Djoerd Hiemstra University of Twente

More information

Finding Topic-centric Identified Experts based on Full Text Analysis

Finding Topic-centric Identified Experts based on Full Text Analysis Finding Topic-centric Identified Experts based on Full Text Analysis Hanmin Jung, Mikyoung Lee, In-Su Kang, Seung-Woo Lee, Won-Kyung Sung Information Service Research Lab., KISTI, Korea jhm@kisti.re.kr

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information

On Duplicate Results in a Search Session

On Duplicate Results in a Search Session On Duplicate Results in a Search Session Jiepu Jiang Daqing He Shuguang Han School of Information Sciences University of Pittsburgh jiepu.jiang@gmail.com dah44@pitt.edu shh69@pitt.edu ABSTRACT In this

More information

Northeastern University in TREC 2009 Million Query Track

Northeastern University in TREC 2009 Million Query Track Northeastern University in TREC 2009 Million Query Track Evangelos Kanoulas, Keshi Dai, Virgil Pavlu, Stefan Savev, Javed Aslam Information Studies Department, University of Sheffield, Sheffield, UK College

More information

IRCE at the NTCIR-12 IMine-2 Task

IRCE at the NTCIR-12 IMine-2 Task IRCE at the NTCIR-12 IMine-2 Task Ximei Song University of Tsukuba songximei@slis.tsukuba.ac.jp Yuka Egusa National Institute for Educational Policy Research yuka@nier.go.jp Masao Takaku University of

More information

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks University of Amsterdam at INEX 2010: Ad hoc and Book Tracks Jaap Kamps 1,2 and Marijn Koolen 1 1 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 2 ISLA, Faculty of Science,

More information

Merging algorithms for enterprise search

Merging algorithms for enterprise search Merging algorithms for enterprise search PengFei (Vincent) Li Australian National University u4959060@anu.edu.au Paul Thomas CSIRO and Australian National University paul.thomas@csiro.au David Hawking

More information

Patent Classification Using Ontology-Based Patent Network Analysis

Patent Classification Using Ontology-Based Patent Network Analysis Association for Information Systems AIS Electronic Library (AISeL) PACIS 2010 Proceedings Pacific Asia Conference on Information Systems (PACIS) 2010 Patent Classification Using Ontology-Based Patent Network

More information

Report on the CLEF-IP 2012 Experiments: Search of Topically Organized Patents

Report on the CLEF-IP 2012 Experiments: Search of Topically Organized Patents Report on the CLEF-IP 2012 Experiments: Search of Topically Organized Patents Michail Salampasis Vienna University of Technology Institute of Software Technology and Interactive Systems Vienna, Austria

More information

Inter and Intra-Document Contexts Applied in Polyrepresentation

Inter and Intra-Document Contexts Applied in Polyrepresentation Inter and Intra-Document Contexts Applied in Polyrepresentation Mette Skov, Birger Larsen and Peter Ingwersen Department of Information Studies, Royal School of Library and Information Science Birketinget

More information

Content-based search in peer-to-peer networks

Content-based search in peer-to-peer networks Content-based search in peer-to-peer networks Yun Zhou W. Bruce Croft Brian Neil Levine yzhou@cs.umass.edu croft@cs.umass.edu brian@cs.umass.edu Dept. of Computer Science, University of Massachusetts,

More information

External Query Reformulation for Text-based Image Retrieval

External Query Reformulation for Text-based Image Retrieval External Query Reformulation for Text-based Image Retrieval Jinming Min and Gareth J. F. Jones Centre for Next Generation Localisation School of Computing, Dublin City University Dublin 9, Ireland {jmin,gjones}@computing.dcu.ie

More information

A Unified User Profile Framework for Query Disambiguation and Personalization

A Unified User Profile Framework for Query Disambiguation and Personalization A Unified User Profile Framework for Query Disambiguation and Personalization Georgia Koutrika, Yannis Ioannidis University of Athens, Greece {koutrika, yannis}@di.uoa.gr Abstract. Personalization of keyword

More information

Focused Retrieval Using Topical Language and Structure

Focused Retrieval Using Topical Language and Structure Focused Retrieval Using Topical Language and Structure A.M. Kaptein Archives and Information Studies, University of Amsterdam Turfdraagsterpad 9, 1012 XT Amsterdam, The Netherlands a.m.kaptein@uva.nl Abstract

More information

Indri at TREC 2005: Terabyte Track (Notebook Version)

Indri at TREC 2005: Terabyte Track (Notebook Version) Indri at TREC 2005: Terabyte Track (Notebook Version) Donald Metzler, Trevor Strohman, Yun Zhou, W. B. Croft Center for Intelligent Information Retrieval University of Massachusetts, Amherst Abstract This

More information

A Cluster-Based Resampling Method for Pseudo- Relevance Feedback

A Cluster-Based Resampling Method for Pseudo- Relevance Feedback A Cluster-Based Resampling Method for Pseudo- Relevance Feedback Kyung Soon Lee W. Bruce Croft James Allan Department of Computer Engineering Chonbuk National University Republic of Korea Center for Intelligent

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

Document Structure Analysis in Associative Patent Retrieval

Document Structure Analysis in Associative Patent Retrieval Document Structure Analysis in Associative Patent Retrieval Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media Studies University of Tsukuba 1-2 Kasuga, Tsukuba, 305-8550,

More information

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.4, April 2009 349 WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY Mohammed M. Sakre Mohammed M. Kouta Ali M. N. Allam Al Shorouk

More information