A generic ranking function discovery framework by genetic programming for information retrieval q

Size: px
Start display at page:

Download "A generic ranking function discovery framework by genetic programming for information retrieval q"

Transcription

1 Information Processing and Management xxx (2003) xxx xxx A generic ranking function discovery framework by genetic programming for information retrieval q Weiguo Fan a, *, Michael D. Gordon b, Praveen Pathak c a Department of Accounting and Information Systems, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061, USA b Department of Computer and Information Systems, University of Michigan, Ann Arbor, MI 48105, USA c Department of Decision and Information Sciences, University of Florida, Gainesville, FL 32611, USA Received 31 January 2003; accepted 8 August 2003 Abstract Ranking functions play a substantial role in the performance of information retrieval (IR) systems and search engines. Although there are many ranking functions available in the IR literature, various empirical evaluation studies show that ranking functions do not perform consistently well across different contexts (queries, collections, users). Moreover, it is often difficult and very expensive for human beings to design optimal ranking functions that work well in all these contexts. In this paper, we propose a novel ranking function discovery framework based on Genetic Programming and show through various experiments how this new framework helps automate the ranking function design/discovery process. Ó 2003 Published by Elsevier Ltd. Keywords: Information retrieval; Ranking function; Genetic algorithms; Genetic programming; Text mining 1. Introduction The information retrieval (IR) field is undergoing dramatic development and change due to advances in information technology and computation techniques. The large amount of digital information increasingly available in our society makes information retrieval research one of the q An earlier version of this paper was presented at the 2000 International Conference on Information Systems by Fan, Gordon, and Pathak (2000). * Corresponding author. addresses: wfan@vt.edu (W. Fan), mdgordon@umich.edu (M.D. Gordon), praveen@ufl.edu (P. Pathak) /$ - see front matter Ó 2003 Published by Elsevier Ltd. doi: /j.ipm

2 2 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx most exciting and important fields. According to SearchEngineWatch.com, 1 75% of online users use search engines to traverse the web, which indicates the importance of information retrieval in our daily life. Despite the recent advances of information retrieval or search technologies, studies (Gordon & Pathak, 1999) show that the performance of search engines are not quite up to the expectations of end users. Users often spend quite a lot of time sifting through hit lists full of irrelevant results. There are various reasons contributing to the dissatisfaction of end users: imprecise query formulation, unfamiliarity with system usage, etc. We argue in this paper that the ranking strategies adopted by these search engines also deserve part of the blame. Ranking strategies, often called ranking functions in IR, are used to order search results in an order of decreasing relevance match with a userõs search query. Most IR systems use a single fixed ranking strategy to support the information seeking task of all users for all queries irrespective of the heterogeneity of end users and queries so-called consensus search in which the computed relevancy for the entire population is presumed appropriate for each user (Pitkow et al., 2002). It is true that one of the major benefits of consensus search is that all users get the same results, which fosters resultsharing among users (Pitkow et al., 2002). However, there are many other cases where users prefer search results to be tailored to their own personal preference so-called personalized search or personalized ranking (Pitkow et al., 2002). Most current search engines do not support such an advanced personalized search feature. Both consensus search and personalized search require a good ranking function to obtain good performance. Although there are various ranking functions available, most of them are manually designed by IR experts based on heuristics, experience, or observations. Although some of these ranking functions, such as that used in Okapi (see Eq. (2)), are designed based on probabilistic theory, their performance for each individual query is not guaranteed. In other words, even though those theoretically justified ranking functions may work reasonably well on average for a set of queries, they may not work well for each individual query. In fact, various ranking function evaluations and comparative studies (Salton, 1989; Zobel & Moffat, 1998) showed that these ranking functions do not work consistently well across queries. Moreover, it requires a lot of human effort to design a personalized ranking function for each individual query. Finding an optimal ranking function for a particular query or a group of queries remains a design challenge for IR research. In this paper, we introduce a systematic and automatic discovery framework to aid the ranking function design process. This ranking function discovery is based on an artificial intelligence technique called Genetic Programming, which is widely used in various optimal design and data mining applications (Koza, 1992). We show through various experiments using real textual data that the new ranking function discovery framework is a flexible and powerful discovery tool for optimal ranking function design. The remainder of this paper is organized as follows. In Section 2 we review related research on ranking function design and evaluation. In Section 3 we give a formal definition of the ranking function discovery problem and describe our ranking function discovery framework based on Genetic Programming. Section 4 discusses several experiments validating our ranking function 1

3 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx 3 discovery framework. We discuss related work in Section 5 and conclude the paper with implications of this study and future research directions in Section Prior research on ranking function design and evaluation IR systems use ranking functions to order documents according to the documentsõ estimated match with a user query. To facilitate this relevance estimation process, both documents and user queries need to be transformed into a form that can be effectively processed by computers. One of the most successful models is the so-called Vector Space Model (VSM) (Salton, 1971, 1989). The VSM is the underlying model for this study for two reasons: (1) Ease of interpretation The VSM is a well-grounded model. It is based on Vector Spaces and thus can be easily interpreted from a geometric perspective (Jones & Furnas, 1987; Salton, 1989). For example, each document and query vector can be placed in a Euclidean n dimensional space. The properties of such pairs of vectors, such as their similarity and closeness, then can be studied. (2) Great success in performance evaluations The VSM has been one of the most successful models in various performance evaluation studies (Harman, 1993, 1996; Salton, 1971; Salton & Buckley, 1988). Most existing search engines and information retrieval systems are based on this model. In this model, both documents and user queries are represented as vectors of index terms. Suppose there are a total of t index terms in an entire collection, then a given document D and query Q can be represented as follows: D ¼ðw d1 ; w d2 ; w d3 ;...; w dt Þ Q ¼ðw q1 ; w q2 ; w q3 ;...; w qt Þ where w di, w qi (i ¼ 1tot) are term weights assigned to different terms for the document D and query Q, respectively. The similarity between a query and a document can be calculated by the widely used cosine measure (Salton & Buckley, 1988): P t i¼1 SimilarityðQ; DÞ ¼ w qi w di qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P t i¼1 ðw qiþ 2 P t i¼1 ðw diþ 2 ð1þ Documents then are ordered by decreasing values of this measure. The art of designing ranking functions depends on the design of term weighting strategies to assign weights for terms in a document, w di (w qi is commonly represented by 1 and thus can be safely ignored). Different term weighting strategies influence the similarity that is computed according to Formula (1). Three well-designed ranking functions are given below as examples, with variables explained in Table 1.

4 4 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx Table 1 Notation for ranking functions tf Frequency of a term (word) in the document text QTW Weighting strategy of a term (word) in the query text. tf is commonly used as the default weighting strategy N Total number of documents in the collection df Number of documents in the collection in which the term under consideration is present length Length of the document (in words) length avg Average document length in the collection (in words) tf avg Average term frequency of all the terms in a document tf max Maximum term frequency of all the terms in a document 1:0 if tf H max =tf max otherwise Okapi BM25 X T 2Q 3 tf 0:5 þ 1:5 length length avg þ tf Pivoted TFIDF X 1 þ logðtf Þ 1 þ logðtf avg Þ log N þ 1 df T 2Q INQUERY X T 2Q 0:4 þ 0:6 0:4 N df þ 0:5 log QTW df þ 0:5 1 QTW 0:8 þ 0:2 length length avg logðtf þ 0:5Þ H þ 0:6 logðtf max þ 1:0Þ N log df logðnþ QTW ð2þ ð3þ ð4þ The expression of these ranking functions is taken from Singhal, Salton, Mitra, and Buckley (1996). Although Okapi BM25 and INQUERY are not designed based on VSM, they can be represented or approximated using VSM notation (Singhal et al., 1996). It is not difficult to see from the above formula that their differences depend upon their term weighting strategies the way of combining the various weighting features shown in Table 1. The change of a term weighting strategy will essentially change the behavior of a ranking function. Accordingly, from here on, the phrases term weighting strategy and ranking function will be used interchangeably. Of course, there are many other ranking functions available (Salton & Buckley, 1988; Salton & McGill, 1983; Singhal et al., 1996; Zobel & Moffat, 1998). Various evaluation studies (Salton & Buckley, 1988; Singhal et al., 1996; Zobel & Moffat, 1998) on a wide range of ranking functions have shown that performances of ranking functions are very context dependent. These contexts involve queries, users, and text collections. In other words, a ranking function may be good for certain queries but bad for others. The differences in specification of queries and the vocabulary used to form them, the varied statistical distributions for terms in different collections, and the heterogeneity of end users, all contribute to the performance inconsistency of these ranking functions.

5 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx 5 These studies bring up the following issues. a. There is no guarantee that existing ranking functions are the best/optimal ones available. It seems likely that more powerful functions are yet to be discovered. Due to the empirical evidence showing the performance inconsistency of existing ranking functions in various experimental evaluations, there is no reason to believe that the ranking functions adopted by existing IR systems are the optimal ones. Previous proofs of optimality clearly do not fully explain behavior in real situations, since assumptions used in these proofs often do not apply. Chances are that some of the better (high-performing) ranking functions are yet to be discovered (Pathak, Gordon, & Fan, 2000). b. There is no general consensus on which ranking function should be used for which context. Due to the extreme complexity of relevance estimation, the performance of a ranking function is not guaranteed a priori. The same ranking function may work well for some queries but not for others (Salton, 1989; Zobel & Moffat, 1998). In part because there are a large number of ranking functions available, the best way to pick the right ranking function for a given user query is not generally known. c. The IR field should advance if we understand when to use what ranking function. Recent work led by Harman in TREC aims to explore when and why particular strategies perform well, and where the variance in performance originates. Empirical studies are needed to more understanding, that later may lead to better theories of IR. Clearly, a framework for good ranking function design or discovery is much needed. 3. A ranking function discovery framework based on GP 3.1. Nature of the ranking function discovery problem The problem of finding a good ranking function is illustrated in Fig and 0 stand for relevant and non-relevant, respectively, in the column of Rele. of both document tables. Input Output User Query + Doc. Rele. A 1 B 0 C 0 D 1 E 0 F 1 G 1 Ranking Function Discovery Ranking Function Doc. Rele. A 1 D 1 F 1 G 1 B 0 C 0 E 0 Fig. 1. Ranking function discovery problem.

6 6 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx The problem of finding a ranking function can be formalized as follows: Given as input a user query (a set of queries) and a set of training documents with known relevance judgments for the query (queries), a ranking function is sought by the discovery framework that can potentially rank all relevant documents at the top and non-relevant ones at the bottom. Because of the characteristics of the problem discussed above, we chose an inductive learning technique called Genetic Programming (GP) (Koza, 1992) as the underlying learning technique to perform the discovery task. Genetic programming, an extension of Genetic Algorithms (GAs), is a problem-solving system designed following the principles of inheritance and evolution. In GP, each potential solution is called an individual in a population. GP works by iteratively applying genetic transformations, such as crossover and mutation, to a population of individuals to create more diverse and better performing individuals in subsequent generations. Refer to Koza (1992) for a detailed introduction to this field. Both GP and GA have been applied earlier to the information retrieval field (Chen, Chung, Ramsey, & Yang, 1998; Fan et al., 2000; Gordon, 1988, 1991; Martin-Bautista, Vila, & Larsen, 1999; Pathak et al., 2000; Raghavan & Agarwal, 1987). The rationale behind using GP for ranking function discovery is as follows: No stringent requirement for an objective function GP can be used for any type of objective function which GP seeks to optimize. It does not require that the objective function be continuous as long as the objective function can differentiate good solutions from bad ones (Koza, 1992). This property allows common discrete performance measures of IR, such as Average Precision, to be used as legitimate objective functions for GP learning. Ease of representation of ranking functions as solutions and thus adaptable by GP One of the major modeling advantages of GP over other learning techniques, such as Neural Networks, is that it allows many forms of representations for the solutions (Langdon, 1998) in this case, ranking functions. A common representation is to use a tree data structure. An example of representing one term weighting formula using a tree structure is shown in Fig. 2. The tree-based representation allows for ease of parsing and implementation. A term weighting formula is applied to each term within a document to calculate the documentõs similarity to the query. * log / tf * log (N/ df) tf N df Fig. 2. A sample term weighting formula in tree and equation forms.

7 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx 7 Very useful and effective for nonlinear function and structure discovery Because of its unique inherent parallel search mechanism, GP has been widely used in many engineering design, scheduling, and structural discovery problems, where many traditional optimization methods cannot apply or do not work very well (Banzhaf, Nordin, Keller, & Francone, 1998; Koza, 1992). Near globally optimal solution Even though there is no guarantee that the solution identified by GP is globally optimal, it has been empirically found that the solutions discovered by GP are normally better than the ones designed by human beings or discovered using conventional heuristics algorithms. In many cases, the solutions found are very close to being globally optimal (Koza, 1992) The framework for ranking function discovery The detailed description of our ranking function discovery framework, based on Genetic Programming, is summarized in Fig. 3. Note that we only show in Fig. 3 the discovery framework for a set of queries. This set of queries could be submitted from a single user, or from multiple users. This corresponds to the common ad hoc information retrieval task. The discovery framework for a single query (the socalled routing task) is a special case of the framework shown in Fig. 3 and is not given here. More details will be shown later for both single and multiple-query ranking function discovery. As shown in Fig. 3, the entire discovery framework is an iterative process. Starting with a set of training documents with known relevance judgments, GP first operates on a large population of random ranking functions. These ranking functions are then evaluated based on the relevance information from training documents. If the stopping criteria is not met, it will go through the genetic transformation steps to create and evaluate the next generation population iteratively. The validation data set is used to help select ranking functions that generalize well for unseen documents and thus to avoid the effect of over-training. The following is a detailed description of our implementation of the above framework. Terminals In GP, terminals are leaf nodes of a tree data structure (as shown in Fig. 2). Essentially they are weighting features used in term weighting to capture lexical statistics. See Formulas (2) and (3) for examples. In this work, terminals were chosen after examining various term weighting and ranking formulas as in (Salton, 1989; Singhal et al., 1996; Zobel & Moffat, 1998). These terminals are listed in Table 2. Functions Functions are the operations applied to combine terminals and/or sub-trees to produce other trees. The following functions were used in our implementation: þ; ;=;log Initial population generation A population is a set of individuals that represent document term weighting formulas. (For query term weighting, we use tf, which almost always produces a weight of 1 since most of these query terms appear only once.)

8 8 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx Fig. 3. The ranking function discovery framework. Table 2 The terminals used in our GP system Terminals tf tf_max tf_avg tf_doc_max df df_max N length length_avg R n Statistical meaning Number of times a term appeared in a document Maximum tf for a document Average tf for a document Maximum tf in the entire document collection Number of unique documents in which a term appeared Maximum df for all terms in a given query Total number of documents in the entire text collection Length of a document Average length of a document in the entire collection Real constant number randomly generated by the GP system Number of unique terms in a document

9 Table 3 Performance measures and their definitions Measure Definition P_avg The average of the precision scores calculated every time a new relevant document is found, normalized by the total number of relevant documents in a collection R_P T_Recall W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx 9 The precision when T_Rel (total number of relevant documents in a collection) documents are retrieved The recall of the top 1000 documents retrieved The initial set of trees, constrained to have a maximum depth of four levels, were generated by the ramped half-and-half method (Koza, 1992). This method stipulates that half of the randomly generated trees must be generated by a random process which ensures all branches of the maximum initial depth. The remaining randomly generated trees require branches whose lengths do not exceed this depth. These constraints have been found to generate a good initial sample of trees (Koza, 1992). Fitness functions A fitness function measures how effective a ranking function represented by an individual tree is for ranking. In our implementation, Average Precision (P_avg defined in Table 3) in the top DCV 2 documents is used as the fitness function (for multiple queries, it will be the average of the P_avg for all queries). The GP operators Following the suggestion from Koza (1992), we use only Reproduction and Crossover in our implementation. Reproduction. Reproduction copies the top rate r P trees in the current to the next generation directly without undergoing any genetic transformation. The reproduction rate, rate_r, is generally 0.1 or less, and P is the population size. Crossover. Crossover ensures variety by creating trees that differ from their parents. For crossover, a method called tournament selection (Koza, 1992) is used. Tournament selection works by first selecting, with replacement, k (we use 6) trees at random from the current generation. The two trees with the highest fitness are paired and exchange sub-trees. Stopping criterion We stop the GP discovery process after 30 generations. First, the simulation is highly computationally intensive. Second, our pilot experiments with sample queries indicated that 30 generations was a sufficient period to generate high-performing trees. 4. Experiments To test the ranking function discovery framework, we used the Associated Press (AP) news collection from the TREC conference (Harman, 1996) as our textual data. This news collection 2 DCV, standing for Document Cutoff Value, means the total number of results returned by the search engine. It is an arbitrary number. We set it at 1000 throughout the study (Harman, 1996).

10 10 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx contains more than 240,000 news articles from 1988 to 1990 and covers a variety of domains and topics. It has been used widely in the IR field to test new retrieval algorithms. More specifically, we use AP88 (79,919 documents) as the training data, AP89 (84,678 documents) as the validation data, and AP90 (78,321 documents) as the test data. The reason to use the three-part data set design in our experiments is to overcome the potential overfitting problem (Fan et al., 2000), in which the model trained from the training data does not generalize very well for new, unseen data. Because the validation data set is not used during the GP training process, its use after training helps select better trees that can generalize well for unseen documents. Thus the three-part data set design is expected to help alleviate or reduce the effect of overfitting for GP. This is also the strategy commonly used in machine learning experiments (Mitchell, 1997). For test queries, we used the 50 topics provided in TREC 7 (Voorhees & Harman, 1998). All the documents in the AP collection have been judged for relevance for each of these topics. Two different versions of the topic description were indexed. In one version, only the title + description portion of the topics is indexed. These queries are called short queries. The number of title terms contained in the short queries varies from 4 to 18. In the other version, all the components of the topics (title, description, narrative, and concepts) are indexed. Correspondingly, these queries are called long queries. The number of terms for the long queries varies from 17 to 96. These two versions of queries are created for the following reasons: (a) most user-provided queries are very short (Jansen, Spink, & Saracevic, 2000); and (b) we seek to identify whether there is any query effect due to query length differences. We use the performance measures defined in Table 3. Among the three measures, P_avg is the primary one used by TREC for cross-system comparisons. We designed two experiments to test our ranking function discovery framework. The first experiment tested the framework at the individual query level with the aim for personalized ranking. The second experiment tested the framework on a group of queries for the purpose of consensus ranking or generic ranking function optimization. For both experiments, we use the Okapi and PTFIDF ranking functions shown in Formula (2) and (3) as the baselines for comparison purposes. We believe the experimental results from these two experiments should provide us with enough insight about the efficacy of the ranking function discovery framework to discover new ranking functions for different contexts. Both experiments used the same settings for the GP system (shown in Table 4). We describe each of the two experiments next Ranking function discovery for information routing In this experiment, we apply the ranking function discovery framework to each individual query to find the optimal ranking function for it and then use the discovered ranking function for later incoming documents. This corresponds to the information routing task. Because each training query, along with its relevance judgment, is submitted from one assessor, we also call this task a personalized search task (Pitkow et al., 2002) because we believe the preference of the assessor has already been embedded in the query relevance judgment process. Among the 50 queries we constructed from the 50 topics, 15 of them have few (0 4) relevant documents in the training, validation, or test data. Such a small number of relevant documents is

11 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx 11 Table 4 GP system experimental settings Population size 200 Crossover rate 0.9 Reproduction rate 0.1 Generations 30 unlikely to help the performance calculation and the discovery process. Thus these 15 queries were excluded in our experiments, which left us 35 queries for the first experiment. Table 5 summarizes the aggregate performance comparison between Personalized ranking functions (RFs) discovered by GP and two baseline ranking functions (PTFIDF and Okapi) on three measures: P_avg, R_P, and T_Recall. For PTFIDF and Okapi, we report the absolute performance results only. For the Personalized RFs, we report additional performance gain by Personalized RFs over the two baseline systems PTFIDF and Okapi, respectively. As can been from Table 5, Okapi in general performed better than PTFIDF. Personalized ranking functions discovered by GP performed the best among the three approaches. Overall, the Personalized RFs for individual queries discovered by GP outperformed PTFIDF on P_avg by more than 16% for short queries and more than 32% for long queries. Similarly, Personalized RFs outperformed Okapi on P_avg by 10.71% for short queries and by more than 17% for long queries. In terms of R_P (which measures the precision in the top T_Rel documents), Ranking functions discovered by GP again showed advantages over the two baseline approaches. The gain by Personalized RFs over PTFIDF for short and long queries is 16.79% and 31.98%, respectively. The gain by Personalized RFs over Okapi for short and long queries is 7.03% and 19.36%, respectively. In terms of total recall (T_Recall), there is no difference between Personalized RFs and Okapi. On the other hand, Personalized RFs performed better than PTFIDF by 4.29% for short queries and 8.43% for long queries. Note that T_Recall is probably not as meaningful as the other measures since people seldom retrieve 1000 documents. A subsequent repeated measure MANOVA test on the three performance measures P_avg, R_P and T_Recall showed that RFs do make a significant impact on the search performance (F ¼ 4:969, p ¼ 0:00). We followed up with a multiple comparison with Bonferroni adjustment on all three measures to see whether the Table 5 Performance results of individual/personalized ranking function (RF) discovery Query Approach P_avg R_P T_Recall Short PTFIDF Okapi Personalized RFs by GP vs. PTFIDF % % +4.29% vs. Okapi % +7.03% )0.13% Long PTFIDF Okapi Personalized RFs by GP vs. PTFIDF % % +8.43% vs. Okapi % % )1.31%

12 12 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx results obtained by the Personalized RFs are significantly better than the baseline systems. The results were statistically significant with personalized RFs better than the two baselines on both P_avg and R_P (both p s < 0:01). There is a significant difference between Personalized RFs with PTFIDF, but not with Okapi, on T_Recall. Another interesting observation from Table 5 is that all three approaches performed much better with long queries than short queries. This indicates the importance of better query representation. Longer queries tend to capture usersõ information needs better than short ones. With the query representations changed from short to long, the ranking function discovery framework by GP gained the most among the three approaches compared Ranking function discovery for ad hoc information retrieval In this experiment, we apply the ranking function discovery framework to all 50 queries to find the optimal consensus ranking function for all the queries. This corresponds to the ad hoc information retrieval task. Note that we include those 15 queries excluded in experiment 1 because the fewer relevant documents for a particular query will not have much impact on the evaluation of a consensus ranking function because the fitness function for each ranking tree is changed to be the average of the P_avg for all queries. The performance results reported in Table 6 for PTFIDF and Okapi are different from those reported in Table 5 due to the addition of 15 more queries. Table 6 summarizes the aggregate performance comparison between the consensus RF discovered by GP, PTFIDF, Okapi on three measures: P_avg, R_P, and T_Recall. For PTFIDF and Okapi, we report the absolute performance results only. For the consensus RF, we report additional performance gain over the two baseline systems PTFIDF and Okapi, respectively. As can be seen from Table 6, the consensus ranking function discovered by GP for 50 short queries outperformed both Okapi and PTFIDF in all three performance measures. The discovered consensus ranking function gained the most in R_P: more than 52% over PTFIDF and more than 37% over Okapi. This result indicates that the new consensus ranking function ranks more relevant documents at the top than both Okapi and PTFIDF, which is a property well sought in any search engine performance optimization. The consensus ranking function discovered by GP for 50 long queries outperformed PTFIDF in all three performance measures, and outperformed Table 6 Performance results of consensus ranking function discovery for multiple queries Query Approach P_avg R_P T_Recall Short PTFIDF Okapi Consensus RF by GP vs. PTFIDF % % +6.82% vs. Okapi +3.13% % +2.94% Long PTFIDF Okapi Consensus RF by GP vs. PTFIDF % % +5.60% vs. Okapi +5.71% % )0.28%

13 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx 13 Okapi in both P_avg and R_P. Its performance in T_Recall is very close to that of Okapi. A subsequent repeated measure MANOVA test on the three performance measures P_avg, R_P and T_Recall showed that the RFs again are statistically siginificant in ranking performance. A follow-up multiple comparison with Bonferroni adjustment showed that there is a significant difference between the consensus ranking function discovery by GP with PTFIDF on all three measures (all p s < 0:01). Significant difference were detected between the consensus RF discovered by GP with Okapi on R_P, but not on P_avg and T_Recall. This latter point indicates the newly discovered ranking function is very effective in terms of returning relevant documents at the top of a hit list on average, which is a well-sought feature in ad hoc information retrieval. Overall, GP has a better performance gain over Okapi in P_avg for long queries (5.71%) than short queries (3.13%). But these performance gains are much smaller than those in the routing task (more than 10%), which indicates the fact that it is a more challenging and difficult task to discover a better ranking function for multiple queries than for individual queries. To give readers an idea of the ranking functions we discovered using GP, we list the best consensus ranking function discovered by GP for 50 short queries in Fig. 4. The ranking tree shown in Fig. 4 can be expressed in the following mathematical form: tf tf N log tf tf avg þ þ logðtf 2 tf avgþ df n þ 2 tf doc max þ 0:373 tf avgðtf doc maxþnþ df It is interesting to note that GP can discover some of the well known ranking strategies, like TFIDF tf N =df. Moreover, the denominator in Formula (5) shows that other alternative normalization factors may exist beside the traditional document length normalization as used in Okapi and Ptfidf. To further verify the generalizability of the newly discovered ranking function, we tested it along with the two baseline ranking functions Okapi and PTFIDF on the 10 GB web corpus used in TREC 10 for 50 short queries. The results are summarized in Table 7. It is not difficult to see from the results in Table 7 that the newly discovered ranking function shown in Formula (5) performed very well in the web search context. It maintained the performance gain over both Okapi and PTFIDF as in the AP news corpus. ð5þ Fig. 4. The best consensus ranking function discovered by GP for 50 short queries.

14 14 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx Table 7 Performance results of three RFs on 10 GB web corpus Query Approach P_avg R_P T_Recall Short PTFIDF Okapi Consensus RF by GP vs. PTFIDF % % % vs. Okapi +3.34% +6.83% +3.13% 5. Related work There have been several efforts on ranking function optimization in IR literature. The earliest work is done by Fox, 1983, in which Fox used polynomial regression to optimize the ranking function. Fuhr et al. (Fuhr & Buckley, 1991; Fuhr & Pfeifer, 1994) used probabilistic models as machine learning approaches. The concept of relevance description used in Fuhr and Buckley (1991), Fuhr and Pfeifer (1994) are very similar to the weighting evidences (tf ; df ;...)we used for ranking. The difference in our work from theirs is that we use a ranking function of arbitrary numerical functional form designed from GP, while in Fuhr and Buckley (1991), Fuhr and Pfeifer (1994), the ranking function (called retrieval function in Fuhr and Buckley (1991), Fuhr and Pfeifer (1994)) is either a polynomial regression function (Fuhr & Buckley, 1991), or logistic regression/loglinear function (Fuhr & Pfeifer, 1994). Similar ideas using logistic regression for ranking function design and optimization have also been explored in Gey (1994). Another line of research on ranking function optimization is following the combination of experts approach, in which a set of ranking functions are brought together either numerically through linear combination (Bartell, Cottrell, & Belew, 1994; Fox, Koushik, Shaw, Modlin, & Rao, 1993; Pathak et al., 2000; Vogt & Cottrell, 1999), or simple majority vote (Fox et al., 1993; Lee, 1997). The effectiveness is limited by the number of experts (ranking functions) they used and how effective they are individually. Our work, in fact, can produce new ranking functions with better performance than existing ones. These newly discovered ranking functions can be combined using the combination of experts approach with other well-known ranking functions to further improve the ranking performance. 6. Conclusions In this paper, by effectively leveraging the clues of different weighting features used by many IR experts, we demonstrated that a machine intelligence tool like GP can help us automate and discover better ranking functions for a variety of contexts, which would be, otherwise, very tedious and difficult for any human being. More specifically, the new ranking function discovery framework based on GP can be used to effectively discover either personalized ranking functions for each individual query or a consensus ranking function for a group of queries. The comparison results with two well-known baselines demonstrated the advantage of this approach. The proposed framework is well-suited for both information routing and ad hoc information retrieval.

15 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx 15 We believe the ranking function discovery framework can be used as an effective personalization tool for user preference modeling. A userõs preference model, once discovered, can be used for future information discovery and filtering. This new discovery framework also can be used by search engine vendors to fine-tune their search enginesõ performance. Many queries submitted to search engines are repetitive over time. Optimizing the ranking functions used for these queries obviously will help improve the success rate and user satisfaction on these queries. In the future, we plan to test the discovery framework in the web search context. Currently, we have only tested the framework using the news corpus from Associated Press. Web documents are more heterogeneous than the news corpus. They also contain various tags, such as <Anchor>, <Title>, and <Meta>, that may potentially help improve a search engineõs ranking performance. We plan to extend our framework to include structural and semantic information and to test it on the web. We also would like to apply GP to other text mining tasks such as text classification, text summarization, and question answering. All these tasks require effective ranking of texts of different granularities: phrases, sentences, or documents. In all these cases, GP can be used as a machine learning tool to help discover the optimal way to rank these text units. References Banzhaf, W., Nordin, P., Keller, R. E., & Francone, F. D. (1998). Genetic programming: an introduction on the automatic evolution of computer programs and its applications. San Francisco, CA: Morgan Kaufmann Publishers. Bartell, B. T., Cottrell, G. W., & Belew, R. K. (1994). Automatic combination of multiple ranked retrieval systems. In The proceedings of seventeenth annual international ACM SIGIR conference on research and development in information retrieval (pp ). Available: citeseer.nj.nec.com/bartell94automatic.html. Chen, H., Chung, Y., Ramsey, M., & Yang, C. (1998). A smart itsy bitsy spider for the web. Journal of the American Society for Information Science, 49(7), Fan, W., Gordon, M. D., & Pathak, P. (2000). Personalization of search engine services for effective retrieval and knowledge management. In Proceedings of 2000 international conference on information systems (ICIS), Brisbane, Australia (pp ). Fox, E. A. (1983). Extending the boolean and vector space models of information retrieval with p-norm queries and multiple concept types. Ph.D. thesis, Cornell University. Fox, E. A., Koushik, M. P., Shaw, J., Modlin, R., & Rao, D. (1993). Combining evidence from multiple searches. In Proceedings of the first text retrieval conference (TREC-1). NIST Special Publication (pp ). Fuhr, N., & Buckley, C. (1991). A probabilistic learning approach for document indexing. ACM Transactions on Information Systems, 9(3), Available: citeseer.nj.nec.com/fuhr91probabilistic.html. Fuhr, N., & Pfeifer, U. (1994). Probabilistic information retrieval as combination of abstraction, inductive learning and probabilistic assumptions. ACM Transactions on Information Systems, 12(1), Available: citeseer.nj.nec.com/ fuhr94probabilistic.html. Gey, F. C. (1994). Inferring probability of relevance using the method of logistic regression. In The proceedings of seventeenth annual international ACM SIGIR conference on research and development in information retrieval (pp ). Gordon, M. (1988). Probabilistic and genetic algorithms for document retrieval. Communications of ACM, 31(2), Gordon, M. (1991). User-based document clustering by redescribing subject descriptions with a genetic algorithm. Journal of the American Society for Information Science, 42(5), Gordon, M., & Pathak, P. (1999). Finding information on the World Wide Web: the retrieval effectiveness of search engines. Information Processing and Management, 35(2),

16 16 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx Harman, D. K. (1993). Overview of the first text retrieval conference (TREC-1). In D. K. Harman (Ed.), Proceedings of the first text retrieval conference. NIST Special Publication (pp. 1 20). Harman, D. K. (1996). Overview of the fourth text retrieval conference (TREC-4). In D. K. Harman (Ed.), Proceedings of the fourth text retrieval conference. NIST Special Publication (pp. 1 24). Jansen, B. J., Spink, A., & Saracevic, T. (2000). Real life, real users, and real needs: a study and analysis of user queries on the web. Information Processing and Management, 36(2), Jones, W. P., & Furnas, G. W. (1987). Pictures of relevance: a geometric analysis of similarity measures. Journal of the American Society for Information Science, 38(6), Koza, J. R. (1992). Genetic programming: on the programming of computers by means of natural selection. Cambridge, MA, USA: MIT Press. Langdon, W. B. (1998). Data structures and genetic programming: genetic programming + data structures ¼ automatic programming. Kluwer Publishing. Lee, J. H. (1997). Analyses of multiple evidence combination. In The proceedings of twentieth annual international ACM SIGIR conference on research and development in information retrieval (pp ). Martin-Bautista, M. J., Vila, M., & Larsen, H. L. (1999). A fuzzy genetic algorithm approach to an adaptive information retrieval agent. Journal of the American Society for Information Science, 50(9), Mitchell, T. M. (1997). Machine learning. New York, NY: McGraw Hill. Pathak, P., Gordon, M., & Fan, W. (2000). Effective information retrieval using genetic algorithms based matching function adaptation. In Proceedings of the 33rd Hawaii international conference on system science (HICSS), Hawaii, USA. Pitkow, J., Schutze, H., Cass, T., Cooley, R., Turnbull, D., Edmonds, A., Adar, E., & Breuel, T. (2002). Personalized search. Communications of the ACM, 45(9), Raghavan, V. V., & Agarwal, B. (1987). Optimal determination of user-oriented clusters: an application for the reproductive plan. In Proceedings of the second international conference on genetic algorithms and their applications, Cambridge, MA (pp ). Salton, G. (1971). The SMART retrieval system: experiments in automatic document processing. New Jersey: Prentice Hall. Salton, G. (1989). Automatic text processing. Reading, MA: Addison-Wesley Publishing Co. Salton, G., & Buckley, C. (1988). Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. McGraw-Hill. Singhal, A., Salton, G., Mitra, M., & Buckley, C. (1996). Document length normalization. Information Processing and Management, 32(5), Vogt, C. C., & Cottrell, G. W. (1999). Fusion via a linear combination of scores. Information Retrieval, 1(3), Voorhees, E. M., & Harman, D. K. (1998). Overview of the seventh text retrieval conference (TREC-7). In E. M. Voorhees & D. K. Harman (Eds.), Proceedings of the seventh text retrieval conference. NIST Special Publication (pp. 1 24). Zobel, J., & Moffat, A. (1998). Exploring the similarity space. SIGIR Forum, 32(1),

Effective Information Retrieval using Genetic Algorithms based Matching Functions Adaptation

Effective Information Retrieval using Genetic Algorithms based Matching Functions Adaptation Effective Information Retrieval using Genetic Algorithms based Matching Functions Adaptation Praveen Pathak Michael Gordon Weiguo Fan Purdue University University of Michigan pathakp@mgmt.purdue.edu mdgordon@umich.edu

More information

Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied

Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied Information Processing and Management 43 (2007) 1044 1058 www.elsevier.com/locate/infoproman Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied Anselm Spoerri

More information

RMIT University at TREC 2006: Terabyte Track

RMIT University at TREC 2006: Terabyte Track RMIT University at TREC 2006: Terabyte Track Steven Garcia Falk Scholer Nicholas Lester Milad Shokouhi School of Computer Science and IT RMIT University, GPO Box 2476V Melbourne 3001, Australia 1 Introduction

More information

Term Frequency Normalisation Tuning for BM25 and DFR Models

Term Frequency Normalisation Tuning for BM25 and DFR Models Term Frequency Normalisation Tuning for BM25 and DFR Models Ben He and Iadh Ounis Department of Computing Science University of Glasgow United Kingdom Abstract. The term frequency normalisation parameter

More information

Ranking Function Optimizaton Based on OKAPI and K-Means

Ranking Function Optimizaton Based on OKAPI and K-Means 2016 International Conference on Mechanical, Control, Electric, Mechatronics, Information and Computer (MCEMIC 2016) ISBN: 978-1-60595-352-6 Ranking Function Optimizaton Based on OKAPI and K-Means Jun

More information

Lecture 5: Information Retrieval using the Vector Space Model

Lecture 5: Information Retrieval using the Vector Space Model Lecture 5: Information Retrieval using the Vector Space Model Trevor Cohn (tcohn@unimelb.edu.au) Slide credits: William Webber COMP90042, 2015, Semester 1 What we ll learn today How to take a user query

More information

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback RMIT @ TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback Ameer Albahem ameer.albahem@rmit.edu.au Lawrence Cavedon lawrence.cavedon@rmit.edu.au Damiano

More information

From Passages into Elements in XML Retrieval

From Passages into Elements in XML Retrieval From Passages into Elements in XML Retrieval Kelly Y. Itakura David R. Cheriton School of Computer Science, University of Waterloo 200 Univ. Ave. W. Waterloo, ON, Canada yitakura@cs.uwaterloo.ca Charles

More information

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM 1 CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM John R. Koza Computer Science Department Stanford University Stanford, California 94305 USA E-MAIL: Koza@Sunburn.Stanford.Edu

More information

Retrieval Evaluation. Hongning Wang

Retrieval Evaluation. Hongning Wang Retrieval Evaluation Hongning Wang CS@UVa What we have learned so far Indexed corpus Crawler Ranking procedure Research attention Doc Analyzer Doc Rep (Index) Query Rep Feedback (Query) Evaluation User

More information

Enabling Users to Visually Evaluate the Effectiveness of Different Search Queries or Engines

Enabling Users to Visually Evaluate the Effectiveness of Different Search Queries or Engines Appears in WWW 04 Workshop: Measuring Web Effectiveness: The User Perspective, New York, NY, May 18, 2004 Enabling Users to Visually Evaluate the Effectiveness of Different Search Queries or Engines Anselm

More information

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst An Evaluation of Information Retrieval Accuracy with Simulated OCR Output W.B. Croft y, S.M. Harding y, K. Taghva z, and J. Borsack z y Computer Science Department University of Massachusetts, Amherst

More information

Evolved term-weighting schemes in Information Retrieval: an analysis of the solution space

Evolved term-weighting schemes in Information Retrieval: an analysis of the solution space Artif Intell Rev (2006) 26:35 47 DOI 10.1007/s10462-007-9034-5 Evolved term-weighting schemes in Information Retrieval: an analysis of the solution space Ronan Cummins Colm O Riordan Received: 1 October

More information

Making Retrieval Faster Through Document Clustering

Making Retrieval Faster Through Document Clustering R E S E A R C H R E P O R T I D I A P Making Retrieval Faster Through Document Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-02 January 23, 2004 D a l l e M o l l e I n s t i t u t e

More information

Investigating the Application of Genetic Programming to Function Approximation

Investigating the Application of Genetic Programming to Function Approximation Investigating the Application of Genetic Programming to Function Approximation Jeremy E. Emch Computer Science Dept. Penn State University University Park, PA 16802 Abstract When analyzing a data set it

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

Query Likelihood with Negative Query Generation

Query Likelihood with Negative Query Generation Query Likelihood with Negative Query Generation Yuanhua Lv Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 ylv2@uiuc.edu ChengXiang Zhai Department of Computer

More information

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts Kwangcheol Shin 1, Sang-Yong Han 1, and Alexander Gelbukh 1,2 1 Computer Science and Engineering Department, Chung-Ang University,

More information

A Formal Approach to Score Normalization for Meta-search

A Formal Approach to Score Normalization for Meta-search A Formal Approach to Score Normalization for Meta-search R. Manmatha and H. Sever Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst, MA 01003

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013 Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

More information

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A thesis Submitted to the faculty of the graduate school of the University of Minnesota by Vamshi Krishna Thotempudi In partial fulfillment of the requirements

More information

Report on the TREC-4 Experiment: Combining Probabilistic and Vector-Space Schemes

Report on the TREC-4 Experiment: Combining Probabilistic and Vector-Space Schemes Report on the TREC-4 Experiment: Combining Probabilistic and Vector-Space Schemes Jacques Savoy, Melchior Ndarugendamwo, Dana Vrajitoru Faculté de droit et des sciences économiques Université de Neuchâtel

More information

CS54701: Information Retrieval

CS54701: Information Retrieval CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful

More information

Web document summarisation: a task-oriented evaluation

Web document summarisation: a task-oriented evaluation Web document summarisation: a task-oriented evaluation Ryen White whiter@dcs.gla.ac.uk Ian Ruthven igr@dcs.gla.ac.uk Joemon M. Jose jj@dcs.gla.ac.uk Abstract In this paper we present a query-biased summarisation

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

A Content Vector Model for Text Classification

A Content Vector Model for Text Classification A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

A Constrained Spreading Activation Approach to Collaborative Filtering

A Constrained Spreading Activation Approach to Collaborative Filtering A Constrained Spreading Activation Approach to Collaborative Filtering Josephine Griffith 1, Colm O Riordan 1, and Humphrey Sorensen 2 1 Dept. of Information Technology, National University of Ireland,

More information

Multimedia Information Systems

Multimedia Information Systems Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive

More information

TREC-10 Web Track Experiments at MSRA

TREC-10 Web Track Experiments at MSRA TREC-10 Web Track Experiments at MSRA Jianfeng Gao*, Guihong Cao #, Hongzhao He #, Min Zhang ##, Jian-Yun Nie**, Stephen Walker*, Stephen Robertson* * Microsoft Research, {jfgao,sw,ser}@microsoft.com **

More information

Reactive Ranking for Cooperative Databases

Reactive Ranking for Cooperative Databases Reactive Ranking for Cooperative Databases Berthier A. Ribeiro-Neto Guilherme T. Assis Computer Science Department Federal University of Minas Gerais Brazil berthiertavares @dcc.ufmg.br Abstract A cooperative

More information

Retrieval Evaluation

Retrieval Evaluation Retrieval Evaluation - Reference Collections Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, Chapter

More information

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming Proceedings of the 2012 International Conference on Industrial Engineering and Operations Management Istanbul, Turkey, July 3 6, 2012 Automatic New Topic Identification in Search Engine Transaction Log

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Automata Construct with Genetic Algorithm

Automata Construct with Genetic Algorithm Automata Construct with Genetic Algorithm Vít Fábera Department of Informatics and Telecommunication, Faculty of Transportation Sciences, Czech Technical University, Konviktská 2, Praha, Czech Republic,

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

An Automatic Reply to Customers Queries Model with Chinese Text Mining Approach

An Automatic Reply to Customers  Queries Model with Chinese Text Mining Approach Proceedings of the 6th WSEAS International Conference on Applied Computer Science, Hangzhou, China, April 15-17, 2007 71 An Automatic Reply to Customers E-mail Queries Model with Chinese Text Mining Approach

More information

Evaluation of Web Search Engines with Thai Queries

Evaluation of Web Search Engines with Thai Queries Evaluation of Web Search Engines with Thai Queries Virach Sornlertlamvanich, Shisanu Tongchim and Hitoshi Isahara Thai Computational Linguistics Laboratory 112 Paholyothin Road, Klong Luang, Pathumthani,

More information

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany.

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany. Routing and Ad-hoc Retrieval with the TREC-3 Collection in a Distributed Loosely Federated Environment Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers University of Dortmund, Germany

More information

Genetic Programming for Data Classification: Partitioning the Search Space

Genetic Programming for Data Classification: Partitioning the Search Space Genetic Programming for Data Classification: Partitioning the Search Space Jeroen Eggermont jeggermo@liacs.nl Joost N. Kok joost@liacs.nl Walter A. Kosters kosters@liacs.nl ABSTRACT When Genetic Programming

More information

Using Query History to Prune Query Results

Using Query History to Prune Query Results Using Query History to Prune Query Results Daniel Waegel Ursinus College Department of Computer Science dawaegel@gmail.com April Kontostathis Ursinus College Department of Computer Science akontostathis@ursinus.edu

More information

Inter and Intra-Document Contexts Applied in Polyrepresentation

Inter and Intra-Document Contexts Applied in Polyrepresentation Inter and Intra-Document Contexts Applied in Polyrepresentation Mette Skov, Birger Larsen and Peter Ingwersen Department of Information Studies, Royal School of Library and Information Science Birketinget

More information

Evaluating the Usefulness of Sentiment Information for Focused Crawlers

Evaluating the Usefulness of Sentiment Information for Focused Crawlers Evaluating the Usefulness of Sentiment Information for Focused Crawlers Tianjun Fu 1, Ahmed Abbasi 2, Daniel Zeng 1, Hsinchun Chen 1 University of Arizona 1, University of Wisconsin-Milwaukee 2 futj@email.arizona.edu,

More information

A Classifier with the Function-based Decision Tree

A Classifier with the Function-based Decision Tree A Classifier with the Function-based Decision Tree Been-Chian Chien and Jung-Yi Lin Institute of Information Engineering I-Shou University, Kaohsiung 84008, Taiwan, R.O.C E-mail: cbc@isu.edu.tw, m893310m@isu.edu.tw

More information

Similarity search in multimedia databases

Similarity search in multimedia databases Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:

More information

A Genetic Programming Slant on the Way to Record De-Duplication in Repositories

A Genetic Programming Slant on the Way to Record De-Duplication in Repositories A Genetic Programming Slant on the Way to Record De-Duplication in Repositories Preethy.S Department of Information Technology N.P.R.College of Engineering and Technology, Dindigul, Tamilnadu, India Daniel

More information

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News Selecting Model in Automatic Text Categorization of Chinese Industrial 1) HUEY-MING LEE 1 ), PIN-JEN CHEN 1 ), TSUNG-YEN LEE 2) Department of Information Management, Chinese Culture University 55, Hwa-Kung

More information

Amit Singhal, Chris Buckley, Mandar Mitra. Department of Computer Science, Cornell University, Ithaca, NY 14853

Amit Singhal, Chris Buckley, Mandar Mitra. Department of Computer Science, Cornell University, Ithaca, NY 14853 Pivoted Document Length Normalization Amit Singhal, Chris Buckley, Mandar Mitra Department of Computer Science, Cornell University, Ithaca, NY 8 fsinghal, chrisb, mitrag@cs.cornell.edu Abstract Automatic

More information

Evolution of the Discrete Cosine Transform Using Genetic Programming

Evolution of the Discrete Cosine Transform Using Genetic Programming Res. Lett. Inf. Math. Sci. (22), 3, 117-125 Available online at http://www.massey.ac.nz/~wwiims/research/letters/ Evolution of the Discrete Cosine Transform Using Genetic Programming Xiang Biao Cui and

More information

RECORD DEDUPLICATION USING GENETIC PROGRAMMING APPROACH

RECORD DEDUPLICATION USING GENETIC PROGRAMMING APPROACH Int. J. Engg. Res. & Sci. & Tech. 2013 V Karthika et al., 2013 Research Paper ISSN 2319-5991 www.ijerst.com Vol. 2, No. 2, May 2013 2013 IJERST. All Rights Reserved RECORD DEDUPLICATION USING GENETIC PROGRAMMING

More information

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LVII, Number 4, 2012 CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL IOAN BADARINZA AND ADRIAN STERCA Abstract. In this paper

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Automatic Programming with Ant Colony Optimization

Automatic Programming with Ant Colony Optimization Automatic Programming with Ant Colony Optimization Jennifer Green University of Kent jg9@kent.ac.uk Jacqueline L. Whalley University of Kent J.L.Whalley@kent.ac.uk Colin G. Johnson University of Kent C.G.Johnson@kent.ac.uk

More information

Semi supervised clustering for Text Clustering

Semi supervised clustering for Text Clustering Semi supervised clustering for Text Clustering N.Saranya 1 Assistant Professor, Department of Computer Science and Engineering, Sri Eshwar College of Engineering, Coimbatore 1 ABSTRACT: Based on clustering

More information

Using Clusters on the Vivisimo Web Search Engine

Using Clusters on the Vivisimo Web Search Engine Using Clusters on the Vivisimo Web Search Engine Sherry Koshman and Amanda Spink School of Information Sciences University of Pittsburgh 135 N. Bellefield Ave., Pittsburgh, PA 15237 skoshman@sis.pitt.edu,

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

A probabilistic description-oriented approach for categorising Web documents

A probabilistic description-oriented approach for categorising Web documents A probabilistic description-oriented approach for categorising Web documents Norbert Gövert Mounia Lalmas Norbert Fuhr University of Dortmund {goevert,mounia,fuhr}@ls6.cs.uni-dortmund.de Abstract The automatic

More information

A Search Relevancy Tuning Method Using Expert Results Content Evaluation

A Search Relevancy Tuning Method Using Expert Results Content Evaluation A Search Relevancy Tuning Method Using Expert Results Content Evaluation Boris Mark Tylevich Chair of System Integration and Management Moscow Institute of Physics and Technology Moscow, Russia email:boris@tylevich.ru

More information

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation Optimization Methods: Introduction and Basic concepts 1 Module 1 Lecture Notes 2 Optimization Problem and Model Formulation Introduction In the previous lecture we studied the evolution of optimization

More information

Impact of Term Weighting Schemes on Document Clustering A Review

Impact of Term Weighting Schemes on Document Clustering A Review Volume 118 No. 23 2018, 467-475 ISSN: 1314-3395 (on-line version) url: http://acadpubl.eu/hub ijpam.eu Impact of Term Weighting Schemes on Document Clustering A Review G. Hannah Grace and Kalyani Desikan

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

Query Expansion with the Minimum User Feedback by Transductive Learning

Query Expansion with the Minimum User Feedback by Transductive Learning Query Expansion with the Minimum User Feedback by Transductive Learning Masayuki OKABE Information and Media Center Toyohashi University of Technology Aichi, 441-8580, Japan okabe@imc.tut.ac.jp Kyoji UMEMURA

More information

A Practical Passage-based Approach for Chinese Document Retrieval

A Practical Passage-based Approach for Chinese Document Retrieval A Practical Passage-based Approach for Chinese Document Retrieval Szu-Yuan Chi 1, Chung-Li Hsiao 1, Lee-Feng Chien 1,2 1. Department of Information Management, National Taiwan University 2. Institute of

More information

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis Meshal Shutaywi and Nezamoddin N. Kachouie Department of Mathematical Sciences, Florida Institute of Technology Abstract

More information

Network Routing Protocol using Genetic Algorithms

Network Routing Protocol using Genetic Algorithms International Journal of Electrical & Computer Sciences IJECS-IJENS Vol:0 No:02 40 Network Routing Protocol using Genetic Algorithms Gihan Nagib and Wahied G. Ali Abstract This paper aims to develop a

More information

Application of a Genetic Algorithm to a Scheduling Assignement Problem

Application of a Genetic Algorithm to a Scheduling Assignement Problem Application of a Genetic Algorithm to a Scheduling Assignement Problem Amândio Marques a and Francisco Morgado b a CISUC - Center of Informatics and Systems of University of Coimbra, 3030 Coimbra, Portugal

More information

Relevance Score Normalization for Metasearch

Relevance Score Normalization for Metasearch Relevance Score Normalization for Metasearch Mark Montague Department of Computer Science Dartmouth College 6211 Sudikoff Laboratory Hanover, NH 03755 montague@cs.dartmouth.edu Javed A. Aslam Department

More information

Risk Minimization and Language Modeling in Text Retrieval Thesis Summary

Risk Minimization and Language Modeling in Text Retrieval Thesis Summary Risk Minimization and Language Modeling in Text Retrieval Thesis Summary ChengXiang Zhai Language Technologies Institute School of Computer Science Carnegie Mellon University July 21, 2002 Abstract This

More information

A Combined Meta-Heuristic with Hyper-Heuristic Approach to Single Machine Production Scheduling Problem

A Combined Meta-Heuristic with Hyper-Heuristic Approach to Single Machine Production Scheduling Problem A Combined Meta-Heuristic with Hyper-Heuristic Approach to Single Machine Production Scheduling Problem C. E. Nugraheni, L. Abednego Abstract This paper is concerned with minimization of mean tardiness

More information

Handling Missing Values via Decomposition of the Conditioned Set

Handling Missing Values via Decomposition of the Conditioned Set Handling Missing Values via Decomposition of the Conditioned Set Mei-Ling Shyu, Indika Priyantha Kuruppu-Appuhamilage Department of Electrical and Computer Engineering, University of Miami Coral Gables,

More information

TREC-3 Ad Hoc Retrieval and Routing. Experiments using the WIN System. Paul Thompson. Howard Turtle. Bokyung Yang. James Flood

TREC-3 Ad Hoc Retrieval and Routing. Experiments using the WIN System. Paul Thompson. Howard Turtle. Bokyung Yang. James Flood TREC-3 Ad Hoc Retrieval and Routing Experiments using the WIN System Paul Thompson Howard Turtle Bokyung Yang James Flood West Publishing Company Eagan, MN 55123 1 Introduction The WIN retrieval engine

More information

ITERATIVE SEARCHING IN AN ONLINE DATABASE. Susan T. Dumais and Deborah G. Schmitt Cognitive Science Research Group Bellcore Morristown, NJ

ITERATIVE SEARCHING IN AN ONLINE DATABASE. Susan T. Dumais and Deborah G. Schmitt Cognitive Science Research Group Bellcore Morristown, NJ - 1 - ITERATIVE SEARCHING IN AN ONLINE DATABASE Susan T. Dumais and Deborah G. Schmitt Cognitive Science Research Group Bellcore Morristown, NJ 07962-1910 ABSTRACT An experiment examined how people use

More information

Query Expansion for Noisy Legal Documents

Query Expansion for Noisy Legal Documents Query Expansion for Noisy Legal Documents Lidan Wang 1,3 and Douglas W. Oard 2,3 1 Computer Science Department, 2 College of Information Studies and 3 Institute for Advanced Computer Studies, University

More information

Formal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T.

Formal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T. Although this paper analyzes shaping with respect to its benefits on search problems, the reader should recognize that shaping is often intimately related to reinforcement learning. The objective in reinforcement

More information

System of Systems Architecture Generation and Evaluation using Evolutionary Algorithms

System of Systems Architecture Generation and Evaluation using Evolutionary Algorithms SysCon 2008 IEEE International Systems Conference Montreal, Canada, April 7 10, 2008 System of Systems Architecture Generation and Evaluation using Evolutionary Algorithms Joseph J. Simpson 1, Dr. Cihan

More information

Verbose Query Reduction by Learning to Rank for Social Book Search Track

Verbose Query Reduction by Learning to Rank for Social Book Search Track Verbose Query Reduction by Learning to Rank for Social Book Search Track Messaoud CHAA 1,2, Omar NOUALI 1, Patrice BELLOT 3 1 Research Center on Scientific and Technical Information 05 rue des 03 frères

More information

Knowledge Engineering in Search Engines

Knowledge Engineering in Search Engines San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2012 Knowledge Engineering in Search Engines Yun-Chieh Lin Follow this and additional works at:

More information

A User Study on Features Supporting Subjective Relevance for Information Retrieval Interfaces

A User Study on Features Supporting Subjective Relevance for Information Retrieval Interfaces A user study on features supporting subjective relevance for information retrieval interfaces Lee, S.S., Theng, Y.L, Goh, H.L.D., & Foo, S. (2006). Proc. 9th International Conference of Asian Digital Libraries

More information

Information Retrieval. hussein suleman uct cs

Information Retrieval. hussein suleman uct cs Information Management Information Retrieval hussein suleman uct cs 303 2004 Introduction Information retrieval is the process of locating the most relevant information to satisfy a specific information

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

A NEW PERFORMANCE EVALUATION TECHNIQUE FOR WEB INFORMATION RETRIEVAL SYSTEMS

A NEW PERFORMANCE EVALUATION TECHNIQUE FOR WEB INFORMATION RETRIEVAL SYSTEMS A NEW PERFORMANCE EVALUATION TECHNIQUE FOR WEB INFORMATION RETRIEVAL SYSTEMS Fidel Cacheda, Francisco Puentes, Victor Carneiro Department of Information and Communications Technologies, University of A

More information

Informativeness for Adhoc IR Evaluation:

Informativeness for Adhoc IR Evaluation: Informativeness for Adhoc IR Evaluation: A measure that prevents assessing individual documents Romain Deveaud 1, Véronique Moriceau 2, Josiane Mothe 3, and Eric SanJuan 1 1 LIA, Univ. Avignon, France,

More information

Problem 1: Complexity of Update Rules for Logistic Regression

Problem 1: Complexity of Update Rules for Logistic Regression Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1

More information

Solving the Travelling Salesman Problem in Parallel by Genetic Algorithm on Multicomputer Cluster

Solving the Travelling Salesman Problem in Parallel by Genetic Algorithm on Multicomputer Cluster Solving the Travelling Salesman Problem in Parallel by Genetic Algorithm on Multicomputer Cluster Plamenka Borovska Abstract: The paper investigates the efficiency of the parallel computation of the travelling

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

A Fusion Approach to XML Structured Document Retrieval

A Fusion Approach to XML Structured Document Retrieval A Fusion Approach to XML Structured Document Retrieval Ray R. Larson School of Information Management and Systems University of California, Berkeley Berkeley, CA 94720-4600 ray@sims.berkeley.edu 17 April

More information

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.4, April 2009 349 WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY Mohammed M. Sakre Mohammed M. Kouta Ali M. N. Allam Al Shorouk

More information

The Genetic Algorithm for finding the maxima of single-variable functions

The Genetic Algorithm for finding the maxima of single-variable functions Research Inventy: International Journal Of Engineering And Science Vol.4, Issue 3(March 2014), PP 46-54 Issn (e): 2278-4721, Issn (p):2319-6483, www.researchinventy.com The Genetic Algorithm for finding

More information

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) Research on Applications of Data Mining in Electronic Commerce Xiuping YANG 1, a 1 Computer Science Department,

More information

NTUBROWS System for NTCIR-7. Information Retrieval for Question Answering

NTUBROWS System for NTCIR-7. Information Retrieval for Question Answering NTUBROWS System for NTCIR-7 Information Retrieval for Question Answering I-Chien Liu, Lun-Wei Ku, *Kuang-hua Chen, and Hsin-Hsi Chen Department of Computer Science and Information Engineering, *Department

More information

Patent Classification Using Ontology-Based Patent Network Analysis

Patent Classification Using Ontology-Based Patent Network Analysis Association for Information Systems AIS Electronic Library (AISeL) PACIS 2010 Proceedings Pacific Asia Conference on Information Systems (PACIS) 2010 Patent Classification Using Ontology-Based Patent Network

More information

Query Modifications Patterns During Web Searching

Query Modifications Patterns During Web Searching Bernard J. Jansen The Pennsylvania State University jjansen@ist.psu.edu Query Modifications Patterns During Web Searching Amanda Spink Queensland University of Technology ah.spink@qut.edu.au Bhuva Narayan

More information

Robust Relevance-Based Language Models

Robust Relevance-Based Language Models Robust Relevance-Based Language Models Xiaoyan Li Department of Computer Science, Mount Holyoke College 50 College Street, South Hadley, MA 01075, USA Email: xli@mtholyoke.edu ABSTRACT We propose a new

More information

Evolving Variable-Ordering Heuristics for Constrained Optimisation

Evolving Variable-Ordering Heuristics for Constrained Optimisation Griffith Research Online https://research-repository.griffith.edu.au Evolving Variable-Ordering Heuristics for Constrained Optimisation Author Bain, Stuart, Thornton, John, Sattar, Abdul Published 2005

More information

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS J.I. Serrano M.D. Del Castillo Instituto de Automática Industrial CSIC. Ctra. Campo Real km.0 200. La Poveda. Arganda del Rey. 28500

More information

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Wolfgang Tannebaum, Parvaz Madabi and Andreas Rauber Institute of Software Technology and Interactive Systems, Vienna

More information

A Case Study on the Similarity Between Source Code and Bug Reports Vocabularies

A Case Study on the Similarity Between Source Code and Bug Reports Vocabularies A Case Study on the Similarity Between Source Code and Bug Reports Vocabularies Diego Cavalcanti 1, Dalton Guerrero 1, Jorge Figueiredo 1 1 Software Practices Laboratory (SPLab) Federal University of Campina

More information

Reducing Redundancy with Anchor Text and Spam Priors

Reducing Redundancy with Anchor Text and Spam Priors Reducing Redundancy with Anchor Text and Spam Priors Marijn Koolen 1 Jaap Kamps 1,2 1 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 2 ISLA, Informatics Institute, University

More information