A generic ranking function discovery framework by genetic programming for information retrieval q

Size: px

Start display at page:

Download "A generic ranking function discovery framework by genetic programming for information retrieval q"

Janis Taylor
6 years ago
Views:

1 Information Processing and Management xxx (2003) xxx xxx A generic ranking function discovery framework by genetic programming for information retrieval q Weiguo Fan a, *, Michael D. Gordon b, Praveen Pathak c a Department of Accounting and Information Systems, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061, USA b Department of Computer and Information Systems, University of Michigan, Ann Arbor, MI 48105, USA c Department of Decision and Information Sciences, University of Florida, Gainesville, FL 32611, USA Received 31 January 2003; accepted 8 August 2003 Abstract Ranking functions play a substantial role in the performance of information retrieval (IR) systems and search engines. Although there are many ranking functions available in the IR literature, various empirical evaluation studies show that ranking functions do not perform consistently well across different contexts (queries, collections, users). Moreover, it is often difficult and very expensive for human beings to design optimal ranking functions that work well in all these contexts. In this paper, we propose a novel ranking function discovery framework based on Genetic Programming and show through various experiments how this new framework helps automate the ranking function design/discovery process. Ó 2003 Published by Elsevier Ltd. Keywords: Information retrieval; Ranking function; Genetic algorithms; Genetic programming; Text mining 1. Introduction The information retrieval (IR) field is undergoing dramatic development and change due to advances in information technology and computation techniques. The large amount of digital information increasingly available in our society makes information retrieval research one of the q An earlier version of this paper was presented at the 2000 International Conference on Information Systems by Fan, Gordon, and Pathak (2000). * Corresponding author. addresses: wfan@vt.edu (W. Fan), mdgordon@umich.edu (M.D. Gordon), praveen@ufl.edu (P. Pathak) /$ - see front matter Ó 2003 Published by Elsevier Ltd. doi: /j.ipm

2 2 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx most exciting and important fields. According to SearchEngineWatch.com, 1 75% of online users use search engines to traverse the web, which indicates the importance of information retrieval in our daily life. Despite the recent advances of information retrieval or search technologies, studies (Gordon & Pathak, 1999) show that the performance of search engines are not quite up to the expectations of end users. Users often spend quite a lot of time sifting through hit lists full of irrelevant results. There are various reasons contributing to the dissatisfaction of end users: imprecise query formulation, unfamiliarity with system usage, etc. We argue in this paper that the ranking strategies adopted by these search engines also deserve part of the blame. Ranking strategies, often called ranking functions in IR, are used to order search results in an order of decreasing relevance match with a userõs search query. Most IR systems use a single fixed ranking strategy to support the information seeking task of all users for all queries irrespective of the heterogeneity of end users and queries so-called consensus search in which the computed relevancy for the entire population is presumed appropriate for each user (Pitkow et al., 2002). It is true that one of the major benefits of consensus search is that all users get the same results, which fosters resultsharing among users (Pitkow et al., 2002). However, there are many other cases where users prefer search results to be tailored to their own personal preference so-called personalized search or personalized ranking (Pitkow et al., 2002). Most current search engines do not support such an advanced personalized search feature. Both consensus search and personalized search require a good ranking function to obtain good performance. Although there are various ranking functions available, most of them are manually designed by IR experts based on heuristics, experience, or observations. Although some of these ranking functions, such as that used in Okapi (see Eq. (2)), are designed based on probabilistic theory, their performance for each individual query is not guaranteed. In other words, even though those theoretically justified ranking functions may work reasonably well on average for a set of queries, they may not work well for each individual query. In fact, various ranking function evaluations and comparative studies (Salton, 1989; Zobel & Moffat, 1998) showed that these ranking functions do not work consistently well across queries. Moreover, it requires a lot of human effort to design a personalized ranking function for each individual query. Finding an optimal ranking function for a particular query or a group of queries remains a design challenge for IR research. In this paper, we introduce a systematic and automatic discovery framework to aid the ranking function design process. This ranking function discovery is based on an artificial intelligence technique called Genetic Programming, which is widely used in various optimal design and data mining applications (Koza, 1992). We show through various experiments using real textual data that the new ranking function discovery framework is a flexible and powerful discovery tool for optimal ranking function design. The remainder of this paper is organized as follows. In Section 2 we review related research on ranking function design and evaluation. In Section 3 we give a formal definition of the ranking function discovery problem and describe our ranking function discovery framework based on Genetic Programming. Section 4 discusses several experiments validating our ranking function 1

3 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx 3 discovery framework. We discuss related work in Section 5 and conclude the paper with implications of this study and future research directions in Section Prior research on ranking function design and evaluation IR systems use ranking functions to order documents according to the documentsõ estimated match with a user query. To facilitate this relevance estimation process, both documents and user queries need to be transformed into a form that can be effectively processed by computers. One of the most successful models is the so-called Vector Space Model (VSM) (Salton, 1971, 1989). The VSM is the underlying model for this study for two reasons: (1) Ease of interpretation The VSM is a well-grounded model. It is based on Vector Spaces and thus can be easily interpreted from a geometric perspective (Jones & Furnas, 1987; Salton, 1989). For example, each document and query vector can be placed in a Euclidean n dimensional space. The properties of such pairs of vectors, such as their similarity and closeness, then can be studied. (2) Great success in performance evaluations The VSM has been one of the most successful models in various performance evaluation studies (Harman, 1993, 1996; Salton, 1971; Salton & Buckley, 1988). Most existing search engines and information retrieval systems are based on this model. In this model, both documents and user queries are represented as vectors of index terms. Suppose there are a total of t index terms in an entire collection, then a given document D and query Q can be represented as follows: D ¼ðw d1 ; w d2 ; w d3 ;...; w dt Þ Q ¼ðw q1 ; w q2 ; w q3 ;...; w qt Þ where w di, w qi (i ¼ 1tot) are term weights assigned to different terms for the document D and query Q, respectively. The similarity between a query and a document can be calculated by the widely used cosine measure (Salton & Buckley, 1988): P t i¼1 SimilarityðQ; DÞ ¼ w qi w di qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P t i¼1 ðw qiþ 2 P t i¼1 ðw diþ 2 ð1þ Documents then are ordered by decreasing values of this measure. The art of designing ranking functions depends on the design of term weighting strategies to assign weights for terms in a document, w di (w qi is commonly represented by 1 and thus can be safely ignored). Different term weighting strategies influence the similarity that is computed according to Formula (1). Three well-designed ranking functions are given below as examples, with variables explained in Table 1.

4 4 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx Table 1 Notation for ranking functions tf Frequency of a term (word) in the document text QTW Weighting strategy of a term (word) in the query text. tf is commonly used as the default weighting strategy N Total number of documents in the collection df Number of documents in the collection in which the term under consideration is present length Length of the document (in words) length avg Average document length in the collection (in words) tf avg Average term frequency of all the terms in a document tf max Maximum term frequency of all the terms in a document 1:0 if tf H max =tf max otherwise Okapi BM25 X T 2Q 3 tf 0:5 þ 1:5 length length avg þ tf Pivoted TFIDF X 1 þ logðtf Þ 1 þ logðtf avg Þ log N þ 1 df T 2Q INQUERY X T 2Q 0:4 þ 0:6 0:4 N df þ 0:5 log QTW df þ 0:5 1 QTW 0:8 þ 0:2 length length avg logðtf þ 0:5Þ H þ 0:6 logðtf max þ 1:0Þ N log df logðnþ QTW ð2þ ð3þ ð4þ The expression of these ranking functions is taken from Singhal, Salton, Mitra, and Buckley (1996). Although Okapi BM25 and INQUERY are not designed based on VSM, they can be represented or approximated using VSM notation (Singhal et al., 1996). It is not difficult to see from the above formula that their differences depend upon their term weighting strategies the way of combining the various weighting features shown in Table 1. The change of a term weighting strategy will essentially change the behavior of a ranking function. Accordingly, from here on, the phrases term weighting strategy and ranking function will be used interchangeably. Of course, there are many other ranking functions available (Salton & Buckley, 1988; Salton & McGill, 1983; Singhal et al., 1996; Zobel & Moffat, 1998). Various evaluation studies (Salton & Buckley, 1988; Singhal et al., 1996; Zobel & Moffat, 1998) on a wide range of ranking functions have shown that performances of ranking functions are very context dependent. These contexts involve queries, users, and text collections. In other words, a ranking function may be good for certain queries but bad for others. The differences in specification of queries and the vocabulary used to form them, the varied statistical distributions for terms in different collections, and the heterogeneity of end users, all contribute to the performance inconsistency of these ranking functions.

5 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx 5 These studies bring up the following issues. a. There is no guarantee that existing ranking functions are the best/optimal ones available. It seems likely that more powerful functions are yet to be discovered. Due to the empirical evidence showing the performance inconsistency of existing ranking functions in various experimental evaluations, there is no reason to believe that the ranking functions adopted by existing IR systems are the optimal ones. Previous proofs of optimality clearly do not fully explain behavior in real situations, since assumptions used in these proofs often do not apply. Chances are that some of the better (high-performing) ranking functions are yet to be discovered (Pathak, Gordon, & Fan, 2000). b. There is no general consensus on which ranking function should be used for which context. Due to the extreme complexity of relevance estimation, the performance of a ranking function is not guaranteed a priori. The same ranking function may work well for some queries but not for others (Salton, 1989; Zobel & Moffat, 1998). In part because there are a large number of ranking functions available, the best way to pick the right ranking function for a given user query is not generally known. c. The IR field should advance if we understand when to use what ranking function. Recent work led by Harman in TREC aims to explore when and why particular strategies perform well, and where the variance in performance originates. Empirical studies are needed to more understanding, that later may lead to better theories of IR. Clearly, a framework for good ranking function design or discovery is much needed. 3. A ranking function discovery framework based on GP 3.1. Nature of the ranking function discovery problem The problem of finding a good ranking function is illustrated in Fig and 0 stand for relevant and non-relevant, respectively, in the column of Rele. of both document tables. Input Output User Query + Doc. Rele. A 1 B 0 C 0 D 1 E 0 F 1 G 1 Ranking Function Discovery Ranking Function Doc. Rele. A 1 D 1 F 1 G 1 B 0 C 0 E 0 Fig. 1. Ranking function discovery problem.

6 6 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx The problem of finding a ranking function can be formalized as follows: Given as input a user query (a set of queries) and a set of training documents with known relevance judgments for the query (queries), a ranking function is sought by the discovery framework that can potentially rank all relevant documents at the top and non-relevant ones at the bottom. Because of the characteristics of the problem discussed above, we chose an inductive learning technique called Genetic Programming (GP) (Koza, 1992) as the underlying learning technique to perform the discovery task. Genetic programming, an extension of Genetic Algorithms (GAs), is a problem-solving system designed following the principles of inheritance and evolution. In GP, each potential solution is called an individual in a population. GP works by iteratively applying genetic transformations, such as crossover and mutation, to a population of individuals to create more diverse and better performing individuals in subsequent generations. Refer to Koza (1992) for a detailed introduction to this field. Both GP and GA have been applied earlier to the information retrieval field (Chen, Chung, Ramsey, & Yang, 1998; Fan et al., 2000; Gordon, 1988, 1991; Martin-Bautista, Vila, & Larsen, 1999; Pathak et al., 2000; Raghavan & Agarwal, 1987). The rationale behind using GP for ranking function discovery is as follows: No stringent requirement for an objective function GP can be used for any type of objective function which GP seeks to optimize. It does not require that the objective function be continuous as long as the objective function can differentiate good solutions from bad ones (Koza, 1992). This property allows common discrete performance measures of IR, such as Average Precision, to be used as legitimate objective functions for GP learning. Ease of representation of ranking functions as solutions and thus adaptable by GP One of the major modeling advantages of GP over other learning techniques, such as Neural Networks, is that it allows many forms of representations for the solutions (Langdon, 1998) in this case, ranking functions. A common representation is to use a tree data structure. An example of representing one term weighting formula using a tree structure is shown in Fig. 2. The tree-based representation allows for ease of parsing and implementation. A term weighting formula is applied to each term within a document to calculate the documentõs similarity to the query. * log / tf * log (N/ df) tf N df Fig. 2. A sample term weighting formula in tree and equation forms.

7 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx 7 Very useful and effective for nonlinear function and structure discovery Because of its unique inherent parallel search mechanism, GP has been widely used in many engineering design, scheduling, and structural discovery problems, where many traditional optimization methods cannot apply or do not work very well (Banzhaf, Nordin, Keller, & Francone, 1998; Koza, 1992). Near globally optimal solution Even though there is no guarantee that the solution identified by GP is globally optimal, it has been empirically found that the solutions discovered by GP are normally better than the ones designed by human beings or discovered using conventional heuristics algorithms. In many cases, the solutions found are very close to being globally optimal (Koza, 1992) The framework for ranking function discovery The detailed description of our ranking function discovery framework, based on Genetic Programming, is summarized in Fig. 3. Note that we only show in Fig. 3 the discovery framework for a set of queries. This set of queries could be submitted from a single user, or from multiple users. This corresponds to the common ad hoc information retrieval task. The discovery framework for a single query (the socalled routing task) is a special case of the framework shown in Fig. 3 and is not given here. More details will be shown later for both single and multiple-query ranking function discovery. As shown in Fig. 3, the entire discovery framework is an iterative process. Starting with a set of training documents with known relevance judgments, GP first operates on a large population of random ranking functions. These ranking functions are then evaluated based on the relevance information from training documents. If the stopping criteria is not met, it will go through the genetic transformation steps to create and evaluate the next generation population iteratively. The validation data set is used to help select ranking functions that generalize well for unseen documents and thus to avoid the effect of over-training. The following is a detailed description of our implementation of the above framework. Terminals In GP, terminals are leaf nodes of a tree data structure (as shown in Fig. 2). Essentially they are weighting features used in term weighting to capture lexical statistics. See Formulas (2) and (3) for examples. In this work, terminals were chosen after examining various term weighting and ranking formulas as in (Salton, 1989; Singhal et al., 1996; Zobel & Moffat, 1998). These terminals are listed in Table 2. Functions Functions are the operations applied to combine terminals and/or sub-trees to produce other trees. The following functions were used in our implementation: þ; ;=;log Initial population generation A population is a set of individuals that represent document term weighting formulas. (For query term weighting, we use tf, which almost always produces a weight of 1 since most of these query terms appear only once.)

8 8 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx Fig. 3. The ranking function discovery framework. Table 2 The terminals used in our GP system Terminals tf tf_max tf_avg tf_doc_max df df_max N length length_avg R n Statistical meaning Number of times a term appeared in a document Maximum tf for a document Average tf for a document Maximum tf in the entire document collection Number of unique documents in which a term appeared Maximum df for all terms in a given query Total number of documents in the entire text collection Length of a document Average length of a document in the entire collection Real constant number randomly generated by the GP system Number of unique terms in a document

9 Table 3 Performance measures and their definitions Measure Definition P_avg The average of the precision scores calculated every time a new relevant document is found, normalized by the total number of relevant documents in a collection R_P T_Recall W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx 9 The precision when T_Rel (total number of relevant documents in a collection) documents are retrieved The recall of the top 1000 documents retrieved The initial set of trees, constrained to have a maximum depth of four levels, were generated by the ramped half-and-half method (Koza, 1992). This method stipulates that half of the randomly generated trees must be generated by a random process which ensures all branches of the maximum initial depth. The remaining randomly generated trees require branches whose lengths do not exceed this depth. These constraints have been found to generate a good initial sample of trees (Koza, 1992). Fitness functions A fitness function measures how effective a ranking function represented by an individual tree is for ranking. In our implementation, Average Precision (P_avg defined in Table 3) in the top DCV 2 documents is used as the fitness function (for multiple queries, it will be the average of the P_avg for all queries). The GP operators Following the suggestion from Koza (1992), we use only Reproduction and Crossover in our implementation. Reproduction. Reproduction copies the top rate r P trees in the current to the next generation directly without undergoing any genetic transformation. The reproduction rate, rate_r, is generally 0.1 or less, and P is the population size. Crossover. Crossover ensures variety by creating trees that differ from their parents. For crossover, a method called tournament selection (Koza, 1992) is used. Tournament selection works by first selecting, with replacement, k (we use 6) trees at random from the current generation. The two trees with the highest fitness are paired and exchange sub-trees. Stopping criterion We stop the GP discovery process after 30 generations. First, the simulation is highly computationally intensive. Second, our pilot experiments with sample queries indicated that 30 generations was a sufficient period to generate high-performing trees. 4. Experiments To test the ranking function discovery framework, we used the Associated Press (AP) news collection from the TREC conference (Harman, 1996) as our textual data. This news collection 2 DCV, standing for Document Cutoff Value, means the total number of results returned by the search engine. It is an arbitrary number. We set it at 1000 throughout the study (Harman, 1996).

10 10 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx contains more than 240,000 news articles from 1988 to 1990 and covers a variety of domains and topics. It has been used widely in the IR field to test new retrieval algorithms. More specifically, we use AP88 (79,919 documents) as the training data, AP89 (84,678 documents) as the validation data, and AP90 (78,321 documents) as the test data. The reason to use the three-part data set design in our experiments is to overcome the potential overfitting problem (Fan et al., 2000), in which the model trained from the training data does not generalize very well for new, unseen data. Because the validation data set is not used during the GP training process, its use after training helps select better trees that can generalize well for unseen documents. Thus the three-part data set design is expected to help alleviate or reduce the effect of overfitting for GP. This is also the strategy commonly used in machine learning experiments (Mitchell, 1997). For test queries, we used the 50 topics provided in TREC 7 (Voorhees & Harman, 1998). All the documents in the AP collection have been judged for relevance for each of these topics. Two different versions of the topic description were indexed. In one version, only the title + description portion of the topics is indexed. These queries are called short queries. The number of title terms contained in the short queries varies from 4 to 18. In the other version, all the components of the topics (title, description, narrative, and concepts) are indexed. Correspondingly, these queries are called long queries. The number of terms for the long queries varies from 17 to 96. These two versions of queries are created for the following reasons: (a) most user-provided queries are very short (Jansen, Spink, & Saracevic, 2000); and (b) we seek to identify whether there is any query effect due to query length differences. We use the performance measures defined in Table 3. Among the three measures, P_avg is the primary one used by TREC for cross-system comparisons. We designed two experiments to test our ranking function discovery framework. The first experiment tested the framework at the individual query level with the aim for personalized ranking. The second experiment tested the framework on a group of queries for the purpose of consensus ranking or generic ranking function optimization. For both experiments, we use the Okapi and PTFIDF ranking functions shown in Formula (2) and (3) as the baselines for comparison purposes. We believe the experimental results from these two experiments should provide us with enough insight about the efficacy of the ranking function discovery framework to discover new ranking functions for different contexts. Both experiments used the same settings for the GP system (shown in Table 4). We describe each of the two experiments next Ranking function discovery for information routing In this experiment, we apply the ranking function discovery framework to each individual query to find the optimal ranking function for it and then use the discovered ranking function for later incoming documents. This corresponds to the information routing task. Because each training query, along with its relevance judgment, is submitted from one assessor, we also call this task a personalized search task (Pitkow et al., 2002) because we believe the preference of the assessor has already been embedded in the query relevance judgment process. Among the 50 queries we constructed from the 50 topics, 15 of them have few (0 4) relevant documents in the training, validation, or test data. Such a small number of relevant documents is

11 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx 11 Table 4 GP system experimental settings Population size 200 Crossover rate 0.9 Reproduction rate 0.1 Generations 30 unlikely to help the performance calculation and the discovery process. Thus these 15 queries were excluded in our experiments, which left us 35 queries for the first experiment. Table 5 summarizes the aggregate performance comparison between Personalized ranking functions (RFs) discovered by GP and two baseline ranking functions (PTFIDF and Okapi) on three measures: P_avg, R_P, and T_Recall. For PTFIDF and Okapi, we report the absolute performance results only. For the Personalized RFs, we report additional performance gain by Personalized RFs over the two baseline systems PTFIDF and Okapi, respectively. As can been from Table 5, Okapi in general performed better than PTFIDF. Personalized ranking functions discovered by GP performed the best among the three approaches. Overall, the Personalized RFs for individual queries discovered by GP outperformed PTFIDF on P_avg by more than 16% for short queries and more than 32% for long queries. Similarly, Personalized RFs outperformed Okapi on P_avg by 10.71% for short queries and by more than 17% for long queries. In terms of R_P (which measures the precision in the top T_Rel documents), Ranking functions discovered by GP again showed advantages over the two baseline approaches. The gain by Personalized RFs over PTFIDF for short and long queries is 16.79% and 31.98%, respectively. The gain by Personalized RFs over Okapi for short and long queries is 7.03% and 19.36%, respectively. In terms of total recall (T_Recall), there is no difference between Personalized RFs and Okapi. On the other hand, Personalized RFs performed better than PTFIDF by 4.29% for short queries and 8.43% for long queries. Note that T_Recall is probably not as meaningful as the other measures since people seldom retrieve 1000 documents. A subsequent repeated measure MANOVA test on the three performance measures P_avg, R_P and T_Recall showed that RFs do make a significant impact on the search performance (F ¼ 4:969, p ¼ 0:00). We followed up with a multiple comparison with Bonferroni adjustment on all three measures to see whether the Table 5 Performance results of individual/personalized ranking function (RF) discovery Query Approach P_avg R_P T_Recall Short PTFIDF Okapi Personalized RFs by GP vs. PTFIDF % % +4.29% vs. Okapi % +7.03% )0.13% Long PTFIDF Okapi Personalized RFs by GP vs. PTFIDF % % +8.43% vs. Okapi % % )1.31%

12 12 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx results obtained by the Personalized RFs are significantly better than the baseline systems. The results were statistically significant with personalized RFs better than the two baselines on both P_avg and R_P (both p s < 0:01). There is a significant difference between Personalized RFs with PTFIDF, but not with Okapi, on T_Recall. Another interesting observation from Table 5 is that all three approaches performed much better with long queries than short queries. This indicates the importance of better query representation. Longer queries tend to capture usersõ information needs better than short ones. With the query representations changed from short to long, the ranking function discovery framework by GP gained the most among the three approaches compared Ranking function discovery for ad hoc information retrieval In this experiment, we apply the ranking function discovery framework to all 50 queries to find the optimal consensus ranking function for all the queries. This corresponds to the ad hoc information retrieval task. Note that we include those 15 queries excluded in experiment 1 because the fewer relevant documents for a particular query will not have much impact on the evaluation of a consensus ranking function because the fitness function for each ranking tree is changed to be the average of the P_avg for all queries. The performance results reported in Table 6 for PTFIDF and Okapi are different from those reported in Table 5 due to the addition of 15 more queries. Table 6 summarizes the aggregate performance comparison between the consensus RF discovered by GP, PTFIDF, Okapi on three measures: P_avg, R_P, and T_Recall. For PTFIDF and Okapi, we report the absolute performance results only. For the consensus RF, we report additional performance gain over the two baseline systems PTFIDF and Okapi, respectively. As can be seen from Table 6, the consensus ranking function discovered by GP for 50 short queries outperformed both Okapi and PTFIDF in all three performance measures. The discovered consensus ranking function gained the most in R_P: more than 52% over PTFIDF and more than 37% over Okapi. This result indicates that the new consensus ranking function ranks more relevant documents at the top than both Okapi and PTFIDF, which is a property well sought in any search engine performance optimization. The consensus ranking function discovered by GP for 50 long queries outperformed PTFIDF in all three performance measures, and outperformed Table 6 Performance results of consensus ranking function discovery for multiple queries Query Approach P_avg R_P T_Recall Short PTFIDF Okapi Consensus RF by GP vs. PTFIDF % % +6.82% vs. Okapi +3.13% % +2.94% Long PTFIDF Okapi Consensus RF by GP vs. PTFIDF % % +5.60% vs. Okapi +5.71% % )0.28%

13 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx 13 Okapi in both P_avg and R_P. Its performance in T_Recall is very close to that of Okapi. A subsequent repeated measure MANOVA test on the three performance measures P_avg, R_P and T_Recall showed that the RFs again are statistically siginificant in ranking performance. A follow-up multiple comparison with Bonferroni adjustment showed that there is a significant difference between the consensus ranking function discovery by GP with PTFIDF on all three measures (all p s < 0:01). Significant difference were detected between the consensus RF discovered by GP with Okapi on R_P, but not on P_avg and T_Recall. This latter point indicates the newly discovered ranking function is very effective in terms of returning relevant documents at the top of a hit list on average, which is a well-sought feature in ad hoc information retrieval. Overall, GP has a better performance gain over Okapi in P_avg for long queries (5.71%) than short queries (3.13%). But these performance gains are much smaller than those in the routing task (more than 10%), which indicates the fact that it is a more challenging and difficult task to discover a better ranking function for multiple queries than for individual queries. To give readers an idea of the ranking functions we discovered using GP, we list the best consensus ranking function discovered by GP for 50 short queries in Fig. 4. The ranking tree shown in Fig. 4 can be expressed in the following mathematical form: tf tf N log tf tf avg þ þ logðtf 2 tf avgþ df n þ 2 tf doc max þ 0:373 tf avgðtf doc maxþnþ df It is interesting to note that GP can discover some of the well known ranking strategies, like TFIDF tf N =df. Moreover, the denominator in Formula (5) shows that other alternative normalization factors may exist beside the traditional document length normalization as used in Okapi and Ptfidf. To further verify the generalizability of the newly discovered ranking function, we tested it along with the two baseline ranking functions Okapi and PTFIDF on the 10 GB web corpus used in TREC 10 for 50 short queries. The results are summarized in Table 7. It is not difficult to see from the results in Table 7 that the newly discovered ranking function shown in Formula (5) performed very well in the web search context. It maintained the performance gain over both Okapi and PTFIDF as in the AP news corpus. ð5þ Fig. 4. The best consensus ranking function discovered by GP for 50 short queries.

14 14 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx Table 7 Performance results of three RFs on 10 GB web corpus Query Approach P_avg R_P T_Recall Short PTFIDF Okapi Consensus RF by GP vs. PTFIDF % % % vs. Okapi +3.34% +6.83% +3.13% 5. Related work There have been several efforts on ranking function optimization in IR literature. The earliest work is done by Fox, 1983, in which Fox used polynomial regression to optimize the ranking function. Fuhr et al. (Fuhr & Buckley, 1991; Fuhr & Pfeifer, 1994) used probabilistic models as machine learning approaches. The concept of relevance description used in Fuhr and Buckley (1991), Fuhr and Pfeifer (1994) are very similar to the weighting evidences (tf ; df ;...)we used for ranking. The difference in our work from theirs is that we use a ranking function of arbitrary numerical functional form designed from GP, while in Fuhr and Buckley (1991), Fuhr and Pfeifer (1994), the ranking function (called retrieval function in Fuhr and Buckley (1991), Fuhr and Pfeifer (1994)) is either a polynomial regression function (Fuhr & Buckley, 1991), or logistic regression/loglinear function (Fuhr & Pfeifer, 1994). Similar ideas using logistic regression for ranking function design and optimization have also been explored in Gey (1994). Another line of research on ranking function optimization is following the combination of experts approach, in which a set of ranking functions are brought together either numerically through linear combination (Bartell, Cottrell, & Belew, 1994; Fox, Koushik, Shaw, Modlin, & Rao, 1993; Pathak et al., 2000; Vogt & Cottrell, 1999), or simple majority vote (Fox et al., 1993; Lee, 1997). The effectiveness is limited by the number of experts (ranking functions) they used and how effective they are individually. Our work, in fact, can produce new ranking functions with better performance than existing ones. These newly discovered ranking functions can be combined using the combination of experts approach with other well-known ranking functions to further improve the ranking performance. 6. Conclusions In this paper, by effectively leveraging the clues of different weighting features used by many IR experts, we demonstrated that a machine intelligence tool like GP can help us automate and discover better ranking functions for a variety of contexts, which would be, otherwise, very tedious and difficult for any human being. More specifically, the new ranking function discovery framework based on GP can be used to effectively discover either personalized ranking functions for each individual query or a consensus ranking function for a group of queries. The comparison results with two well-known baselines demonstrated the advantage of this approach. The proposed framework is well-suited for both information routing and ad hoc information retrieval.

15 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx 15 We believe the ranking function discovery framework can be used as an effective personalization tool for user preference modeling. A userõs preference model, once discovered, can be used for future information discovery and filtering. This new discovery framework also can be used by search engine vendors to fine-tune their search enginesõ performance. Many queries submitted to search engines are repetitive over time. Optimizing the ranking functions used for these queries obviously will help improve the success rate and user satisfaction on these queries. In the future, we plan to test the discovery framework in the web search context. Currently, we have only tested the framework using the news corpus from Associated Press. Web documents are more heterogeneous than the news corpus. They also contain various tags, such as <Anchor>, <Title>, and <Meta>, that may potentially help improve a search engineõs ranking performance. We plan to extend our framework to include structural and semantic information and to test it on the web. We also would like to apply GP to other text mining tasks such as text classification, text summarization, and question answering. All these tasks require effective ranking of texts of different granularities: phrases, sentences, or documents. In all these cases, GP can be used as a machine learning tool to help discover the optimal way to rank these text units. References Banzhaf, W., Nordin, P., Keller, R. E., & Francone, F. D. (1998). Genetic programming: an introduction on the automatic evolution of computer programs and its applications. San Francisco, CA: Morgan Kaufmann Publishers. Bartell, B. T., Cottrell, G. W., & Belew, R. K. (1994). Automatic combination of multiple ranked retrieval systems. In The proceedings of seventeenth annual international ACM SIGIR conference on research and development in information retrieval (pp ). Available: citeseer.nj.nec.com/bartell94automatic.html. Chen, H., Chung, Y., Ramsey, M., & Yang, C. (1998). A smart itsy bitsy spider for the web. Journal of the American Society for Information Science, 49(7), Fan, W., Gordon, M. D., & Pathak, P. (2000). Personalization of search engine services for effective retrieval and knowledge management. In Proceedings of 2000 international conference on information systems (ICIS), Brisbane, Australia (pp ). Fox, E. A. (1983). Extending the boolean and vector space models of information retrieval with p-norm queries and multiple concept types. Ph.D. thesis, Cornell University. Fox, E. A., Koushik, M. P., Shaw, J., Modlin, R., & Rao, D. (1993). Combining evidence from multiple searches. In Proceedings of the first text retrieval conference (TREC-1). NIST Special Publication (pp ). Fuhr, N., & Buckley, C. (1991). A probabilistic learning approach for document indexing. ACM Transactions on Information Systems, 9(3), Available: citeseer.nj.nec.com/fuhr91probabilistic.html. Fuhr, N., & Pfeifer, U. (1994). Probabilistic information retrieval as combination of abstraction, inductive learning and probabilistic assumptions. ACM Transactions on Information Systems, 12(1), Available: citeseer.nj.nec.com/ fuhr94probabilistic.html. Gey, F. C. (1994). Inferring probability of relevance using the method of logistic regression. In The proceedings of seventeenth annual international ACM SIGIR conference on research and development in information retrieval (pp ). Gordon, M. (1988). Probabilistic and genetic algorithms for document retrieval. Communications of ACM, 31(2), Gordon, M. (1991). User-based document clustering by redescribing subject descriptions with a genetic algorithm. Journal of the American Society for Information Science, 42(5), Gordon, M., & Pathak, P. (1999). Finding information on the World Wide Web: the retrieval effectiveness of search engines. Information Processing and Management, 35(2),

16 16 W. Fan et al. / Information Processing and Management xxx (2003) xxx xxx Harman, D. K. (1993). Overview of the first text retrieval conference (TREC-1). In D. K. Harman (Ed.), Proceedings of the first text retrieval conference. NIST Special Publication (pp. 1 20). Harman, D. K. (1996). Overview of the fourth text retrieval conference (TREC-4). In D. K. Harman (Ed.), Proceedings of the fourth text retrieval conference. NIST Special Publication (pp. 1 24). Jansen, B. J., Spink, A., & Saracevic, T. (2000). Real life, real users, and real needs: a study and analysis of user queries on the web. Information Processing and Management, 36(2), Jones, W. P., & Furnas, G. W. (1987). Pictures of relevance: a geometric analysis of similarity measures. Journal of the American Society for Information Science, 38(6), Koza, J. R. (1992). Genetic programming: on the programming of computers by means of natural selection. Cambridge, MA, USA: MIT Press. Langdon, W. B. (1998). Data structures and genetic programming: genetic programming + data structures ¼ automatic programming. Kluwer Publishing. Lee, J. H. (1997). Analyses of multiple evidence combination. In The proceedings of twentieth annual international ACM SIGIR conference on research and development in information retrieval (pp ). Martin-Bautista, M. J., Vila, M., & Larsen, H. L. (1999). A fuzzy genetic algorithm approach to an adaptive information retrieval agent. Journal of the American Society for Information Science, 50(9), Mitchell, T. M. (1997). Machine learning. New York, NY: McGraw Hill. Pathak, P., Gordon, M., & Fan, W. (2000). Effective information retrieval using genetic algorithms based matching function adaptation. In Proceedings of the 33rd Hawaii international conference on system science (HICSS), Hawaii, USA. Pitkow, J., Schutze, H., Cass, T., Cooley, R., Turnbull, D., Edmonds, A., Adar, E., & Breuel, T. (2002). Personalized search. Communications of the ACM, 45(9), Raghavan, V. V., & Agarwal, B. (1987). Optimal determination of user-oriented clusters: an application for the reproductive plan. In Proceedings of the second international conference on genetic algorithms and their applications, Cambridge, MA (pp ). Salton, G. (1971). The SMART retrieval system: experiments in automatic document processing. New Jersey: Prentice Hall. Salton, G. (1989). Automatic text processing. Reading, MA: Addison-Wesley Publishing Co. Salton, G., & Buckley, C. (1988). Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. McGraw-Hill. Singhal, A., Salton, G., Mitra, M., & Buckley, C. (1996). Document length normalization. Information Processing and Management, 32(5), Vogt, C. C., & Cottrell, G. W. (1999). Fusion via a linear combination of scores. Information Retrieval, 1(3), Voorhees, E. M., & Harman, D. K. (1998). Overview of the seventh text retrieval conference (TREC-7). In E. M. Voorhees & D. K. Harman (Eds.), Proceedings of the seventh text retrieval conference. NIST Special Publication (pp. 1 24). Zobel, J., & Moffat, A. (1998). Exploring the similarity space. SIGIR Forum, 32(1),

Effective Information Retrieval using Genetic Algorithms based Matching Functions Adaptation

Effective Information Retrieval using Genetic Algorithms based Matching Functions Adaptation Praveen Pathak Michael Gordon Weiguo Fan Purdue University University of Michigan pathakp@mgmt.purdue.edu mdgordon@umich.edu