Learning Collection Fusion Strategies for Information Retrieval

Size: px

Start display at page:

Download "Learning Collection Fusion Strategies for Information Retrieval"

Cameron Henderson
5 years ago
Views:

1 Appears in Proceedings of the Twelfth Annual Machine Learning Conference, Lake Tahoe, July 1995 Learning Collection Fusion Strategies for Information Retrieval Geoffrey Towell Ellen M. Voorhees Narendra K. Gupta Ben Johnson-Laird Siemens Corporate Research 755 College Road East Princeton, NJ Abstract In this paper we describe an Information Retrieval problem called collection fusion. The collection fusion problem is to maximize the number of relevant natural language documents retrieved given: a natural language query, multiple collections of documents, and a fixed total number of documents to retrieve. We describe two algorithms that use past queries to learn collection fusion strategies. Tests of these algorithms on a corpus of 742,000 documents indicate that they can learn good fusion strategies. Moreover, the strategies learned by our methods are consistently superior to those learned by a standard learning algorithm. 1 INTRODUCTION The goal of an information retrieval (IR) system is to find those documents that are relevant to a natural language query from within a collection of natural language documents. Many IR systems use statistical methods to assign a match score to each document in the collection rather than attempting to understand the text (Salton & McGill, 1983). Scores depend upon factors including, but not limited to: words in the query, words in the document, word frequencies in the document collection, the particular scoring function. Changes to any of these items affect scores. Typically, IR systems have assumed that the documents to be selected among are stored in a single, monolithic collection. Hence, the documents retrieved for a given query are those with the best score. However, there are cases in which it is natural for documents to be stored in several collections, more than one of which may contain documents of interest. For example, you might have collections of sports and business documents. A query about highly paid people might find relevant documents in both collections. When documents are stored in multiple collections, the IR problem is more complex because scores may not be compatible across collections. At the very least, the word frequencies in the document collection will vary. Hence, for a given query, identical documents in two collections can be expected to receive different scores. As a result, scores assigned to documents cannot be used in a straightforward manner to create a total ordering. Without a total ordering (or an alternative such as the optimal fusion see Section 2.4), it is difficult to determine the number of documents to retrieve from each collection. The problem of determining the number of documents to be retrieved from each collection in an environment with multiple document collections is collection fusion. Our goal is to develop collection fusion strategies and to investigate how effectively these strategies can learn from past queries. In general, the precision of our fusion methods is within 10% of the precision obtained when the collections are treated as a single collection. ( is a common measure in IR, it is the number of relevant documents retrieved divided by the number of documents retrieved.) In addition, our fusion methods are more effective than one based upon a standard learning approach. Details of our experiments are reported in Section 4. The next section is the underpinnings of this report. It contains a formal definition of the collection fusion problem and describes two simple, but ineffective, fusion strategies. This section also briefly describes the IR system used in all of our work, the targets to which our strategies will aspire, and the data set upon which our tests are executed. The succeeding section describes our two retrieval methods (Voorhees et al., 1994; Voorhees et al., 1995) and a method that uses neural networks. The final section analyses the strengths and weaknesses of these methods and discusses areas in need of further work.

2 2 UNDERPINNINGS 2.1 FORMAL DEFINITION OF COLLECTION FUSION Consider a set of document collections each of which is accessed through its information server. For a given query Q, each collection I has a certain number of relevant documents. We assume that when the server I is presented with Q, it returns a list of documents sorted by decreasing similarity to the query. We denote the distribution of relevant documents in the retrieved set (i.e. the ranks at which the relevant documents occur) by FQ(S), I a function that maps the number of retrieved documents, S, into a vector whose length is equal to the number of relevant documents in the S documents retrieved (see, for example, the first table in Figure 1). The collection fusion problem can now be formally stated as follows: Given: 1) Q, a query; 2) I1, I2, : : :, I C, information servers; 3) N, the total number of documents to be retrieved; Find: C i=1 i = N and P C i=1 j F Ii Q ( i ) j is maximum values of 1, 2, : : : C such that: That is, the goal is to maximize the total number of relevant documents retrieved. In practice, FQ I is not known and so must be approximated. 2.2 THE TREC DOCUMENT COLLECTION All of the retrieval runs use the approximately 742,000 documents and 200 queries and relevance assessments in the TREC collection (Harman, 1993). This collection is a standard in the IR community. It consists of five subcollections that are of different sizes, cover diverse topics, have different retrieval characteristics, and come from different sources. The sources of the documents are: the A.P. newswire (referred to hereafter as AP), U.S. Department of Energy publications (DOE), the Federal Register (FR), the Wall Street Journal (WSJ), and Ziff-Davis Publishing Computer Selects (ZIFF). We could not use all 200 queries in the TREC collection because a large majority of the queries preferentially select AP. As a result, each of the methods described in Section 3 learns after seeing very few queries that the best strategy is to retrieve documents from AP. Hence, the set of all 200 queries does not adequately test the efficacy of the algorithms. So, we selected a 99 member subset of the queries such that the three largest sub-collections, AP, WSJ, and ZIFF, are preferentially selected by one-third of the queries when a large number of documents are retrieved. Specifically, we selected the 33 queries for these three datasets that maximized number of relevant documents in collection number of relevant documents in AP; W SJ and ZIF F This measure is independent of the underlying IR system; it is the only result in this paper for which this is the case. For all but 3 queries selected by this metric, the number of relevant documents in the prefered collection is greater than the total number of relevant documents in the other collections. (The DOE and FR collections are preferentially selected by very few queries so it is not possible to select a set of queries balanced across all five document sources. Hence, we did not consider them in our selection procedure. Their inclusion would have changed 7 of the 99 queries in our test collection.) Although we chose queries for their prefered collection, when a small number of documents are retrieved (using the IR system described next) these queries have a bias towards AP. For example, when 10 documents are retrieved from each collection, AP has the largest number of relevant documents 51 times. Instructions for obtaining the TREC collection, and a list of the 99 queries we used, are included as an appendix. 2.3 UNDERLYING IR SYSTEM Throughout the remainder of this paper, natural language is not used directly. Rather, we use the vector space model of IR (Salton et al., 1975). In this model, documents and queries are translated into vectors using the following four-step process: 1. remove suffixes (e.g., remove ing from sleeping ); 2. remove stop words (e.g., the, of ); 3. fold the remaining words into a vector in which each position corresponds to a stemmed word; 4. fill the vector with word frequencies. We use the SMART (Buckley, 1985) system as the IR system underlying all of our tests. That is, SMART is used to create the document and query vectors and to score the relationship between documents and queries. In SMART, we used: document vectors as described, query vectors as described except that the frequency of each word was divided by the number of documents in which it occurred in the given collection, and the cosine similarity measure. This combination has proven effective (Salton & Buckley, 1988).

3 For three collections, labeled A, B, and C, the positions of the relevant documents in the 10 most highly ranked documents for a given query, i.e., FQ I (10), is given in the following table. Collection Relevant Document Distribution A B C The following table gives the number of documents to retrieve from each collection given a number of documents to retrieve. Number of Number of documents to Number of documents retrieve from Collection relevant to retrieve A B C notes retrieved plus any plus any tied TARGETS Figure 1: Optimal Fusion We compare our learned fusion strategies to two targets: single collection precision and the optimal fusion. Single collection approximates the ultimate ability of the retrieval system. It assumes that the user can merge all subcollections into a single, monolithic collection. Optimal fusion is a retrospective technique that gives an upper bound for the effectiveness of any fusion strategy. It uses the actual distribution of the relevant documents (F I Q above) to maximize the number of relevant documents retrieved. Figure 1 illustrates this procedure. Figure 2 shows the effectiveness of these two targets as the number of documents to be retrieved is varied. For both targets, as the number of documents to be retrieved increases, the precision decreases. This is as expected, there are a limited number of relevant documents. More interesting is that the precision of the optimal fusion is consistently superior to that of the single collection. This result is interesting because it is encouraging; fusion systems that are less than optimal may still be as effective as a single collection. 2.5 SIMPLE COLLECTION FUSION STRATEGIES Perhaps the simplest fusion strategy, which we will call uniform, is based on the assumption that every collection has the same distribution of relevant documents for every possible query (e.g., if this were true, then in Figure 1 the rows in the first table would always be identical). Under this assumption, retrieving an equal number of documents from each collection will, on average, maximize the number of relevant documents retrieved. In practice, this assumption Optimal fusion Single Collection Merge-sort Uniform Number of documents to be retrieved Figure 2: The precision of two simple fusion strategies and two targets. is not valid. Different collections have different specialties and thus do not have equal numbers of relevant documents. Given different numbers of relevant documents, it is difficult to imagine that relevant documents will be distributed in exactly the same way when the documents are sorted by relevance to a query. The retrieval results in Figure 2 show that the uniform method is ineffective for the TREC collection. By comparison to both the single collection and optimal fusion, the strategy is dreadful. A more promising approach to fusion, which we call merge-sort, is to assume that the relevance scores attributed to documents are comparable across collections and that the scores are accessible. Using merge-sort, the document scores are used to merge the documents from individual collections into a totally ordered list. The problem with this approach is that the assumption of comparable relevance scores is incorrect. At least, each collection will have different distributions of words. In our use of SMART, this difference affects the inverse document frequency weighting of the query vectors. Hence, even if exactly the same retrieval system is used on every collection, the results of the merge-sort fusion will likely be inferior to that of the single collection. If different retrieval systems are used, then the scores attributed to identical documents in different collections may be totally incompatible. (Callan, Lu and Croft (1995) propose methods for making the scores more compatible.) Figure 2 shows merge-sort fusion by comparison to the optimal fusion and the single collection. The merge-sort result is only slightly inferior to that of the single collection. This is as expected, the retrieval system is constant thereby avoiding the major problems of this approach. Still, analysis of individual queries reveals flaws that result from different distributions within collections. For instance consider the TREC query: : : : management effectiveness of the United Nations (UN) and related organizations : : : When

4 retrieving over individual collections, the highest ranked document is in the DOE collection. The abstract of that document begins: The relations of the IAEA with the United Nations, with subsidiary UN organs, with other UN specialized agencies, : : : This document receives a high score because both United Nations and organization are rare terms in the DOE collection. Thus irrelevant documents in the DOE collection that contain these terms have higher similarity than do the relevant documents in the AP collection. 2.6 RELATED WORK We are aware of only two related papers. The first, by Moffet and Zobel (1994), views the goal of collection fusion as reducing the load on a central document server. Hence, they implement collection fusion as a coarse-to-fine search. The central document server has complete statistics about each document collection but uses that information only as a coarse filter, attempting to recognize and eliminate irrelevant collections. This contrasts sharply with our assumption that no machine has access to complete statistics about all documents. The other paper on collection fusion, by Callan, Lu and Croft (1995), relaxes the assumptions of Moffet and Zobel. Whereas Moffet and Zobel assume that the central machine has access to complete statistics on each document, Callan, Lu and Croft assume that the querying machine has complete statistics about each document collection. The document collection statistics are then assembled into a structure that ranks the relevance of each collection to a given query. Because of the difference in assumptions about the available information, neither of these techniques is directly comparable those described in this paper. Hence, neither of these systems is included in the empirical investigation reported in Section 4. 3 LEARNING COLLECTION FUSION STRATEGIES This section describes three approaches to learning collection fusion strategies. We developed the first two strategies specifically for this problem. The third strategy uses neural networks. We chose to use neural networks as an example of generic learning algorithms because they have proven to be at least as effective as other learning algorithms (Atlas et al., 1990; Shavlik et al., 1991) and they are naturally able to learn functions with real-valued outputs. We also evaluated genetic algorithms on this problem (Gupta & Voorhees, 1994). We do not report these results because genetic algorithms did not match the ability of neural networks. The strategy based upon neural networks is essentially a straw man against which we will evaluate our systems. We look upon neural networks as a straw man because we believe that generic learning algorithms will not be able to adequately learn a solution to this task. Rather we believe that to solve this task it is necessary to build specialized algorithms. Analysis of the three algorithms in Section 3.4 and empirical tests in Section 4 support this claim. 3.1 MODELING RELEVANT DOCUMENT DISTRIBUTIONS We call the first algorithm modeling relevant document distributions(mrdd). The algorithm models the relevant document distribution of a query, q, by averaging the relevant document distributions of the k most similar training queries (i.e., the k nearest neighbors of q). MRDD uses query vectors as described in Section 2.3 and cosine distance to determine the similarity between vectors. For all three of the techniques described in this section, the query vectors are built in a space containing only the training queries. As depicted in Figure 3 the first step of MRDD is to find the k most similar training queries. Then, for each collection, determine the average distribution of relevant documents across the k queries. For example, suppose the rows in the top table of Figure 1 represent the relevant document distributionfor a single collection of three different queries. The the average relevant document distribution is: [ ] Once the average relevant document distribution is computed for the current query for each collection, the distributions and the total number of documents to be retrieved are passed to a maximization procedure that determines the optimal fusion given the estimates of FQ I for each collection (step 2 in Figure 3). This procedure finds the cut-off level, i, for each collection that maximizes the number of relevant documents retrieved. (The current maximization procedure simply does an exhaustive search). The computed cut-off levels are the number of documents selected from each collection. Finally (in step 3 of Figure 3), a total ordering is imposed on the model distributions by a random process biased in favor of collections with many remaining documents. This total ordering is used simply to determine the order in which documents presented to a user. Tests revealed that k = 8 performed as well or better than other values (Voorhees et al., 1995). Hence, all tests reported in the next section use this setting.

5 1. Build data structures from M training queries: M=3 training queries C=3 collections a collection of M query vectors distribution of relevant documents for each of M queries in each of C collections 2. Predict the number of documents to retrieve from each collection for a new query: N MAXIMIZATION λ1 λ2 λ3 compute the average distribution of k nearest neighbors in each collection use maximization procedure on N and average distributions to select collection cut-off levels 3. Form ranked result for query: λ1 λ3 λ2 form union of top λc documents from each collection Figure 3: The relevant document distribution fusion strategy. 3.2 QUERY CLUSTERING The second fusion strategy, query clustering (QC), does not form an explicit model of a collection s relevant document distribution. Instead, QC learns a measure of the quality of a search for a particular topic area on the collection. The number of documents retrieved from a collection for a new query is proportional to the value of the quality measure for that query. As in MRDD, QC uses query vectors to represent the queries. Topic areas are represented as centroids of query clusters. For each collection, the set of training queries is clustered using the number of documents retrieved in common between two queries as a similarity measure. The assumption is that if two queries retrieve many documents in common they are about the same topic. The centroid of a query cluster is created by averaging the vectors of the queries contained within the cluster. This centroid is the system s representation of the topic covered by that query cluster. The training phase also assigns to a cluster a weight that reflects how effective queries in the cluster are on that collection. The weight is computed as the average number of relevant documents retrieved by queries in the cluster, where a document is retrieved if it was among the first L (a parameter of the method) documents. After training, queries are processed as shown in Figure 4. The cluster whose centroid vector is most similar to the query vector is selected for the query and the associated weight is returned. The set of weights returned by all the collections are used to apportion the retrieved set such that when N documents are to be returned and w i is the weight w returned by collection i, P i C N (rounded appropriately) documents are retrieved from collection i. For exam- i=1 wi ple, assume the total number of documents to be retrieved is 100, and there are five collections. If the weights returned by the collections are 4, 3, 3, 0, 2, then 33 documents would be retrieved from collection 1, 25 each from collections 2 and 3, none from collection 4, and 17 from collection 5. The weight of a cluster for a single collection in isolation is not meaningful; it is the relative difference in weights returned by the set of collections over which the fusion is to be performed that is important. The only parameter affecting the performance of query clustering is L. Testing indicated that L = 100 is as effective or better than other values (Voorhees et al., 1995).

6 1. Build data structures from M training queries: set of query clusters per collection centroid vector and quality weight per cluster for each collection 2. Form ranked retrieved set for new query: apportion retrieved set according to weights find closest centroid in each collection and return corresponding weight Figure 4: The query clustering fusion strategy. 3.3 NEURAL NETWORKS To apply neural networks (NN) to this task, we trained feedforward networks (Rumelhart et al., 1986) to learn the optimal fusion for a given number of documents to be retrieved. We did this by creating networks with one output unit per collection. (An alternative would be to train one network per collection, i.e., create five networks each with a single output unit. We did not pursue this design because Dietterich s (1990) results on text-to-speech mapping indicate that it is not beneficial.) The target vector was the optimal fusion normalized by the number of documents to be retrieved. As input to the network, we used the same term-frequency weighted query vectors as MRDD and QC. This created networks with approximately 1600 input units and 5 output units. After testing many network configurations, we settled on one with a single, completelyconnected, layer of 10 hidden units. The optimal fusion is not easily learned; it is neither a smooth nor a consistent function. Consider again Figure 1, in particular the second table. At several places, this table notes that one or two documents may be taken from any collection. More of a problem is the transition from 7 to 9 documents to be retrieved. At this point, the optimal fusion changes radically. Such changes are both difficult to learn and almost certainly idiosyncratic to the particular query. This combination is unfortunate. The network will likely spend a large fraction of its resources learning these idiosyncrasies when they probably should be given little consideration. Some of the problems with the lack of smoothness in the output function are ameliorated by evaluating the networks in terms of their precision rather than their ability to reproduce the optimal fusion. Frequently major errors according to the optimal fusion are small or nonexistent in terms of precision. For instance, the difference between the two optimal fusions for eight documents retrieved in Figure 1 is quite large. However, in terms of precision, there is no difference. 3.4 ANALYSIS OF THE LEARNING SYSTEMS In addition to the surface differences in the three algorithms described in this section, there are two important differences in the generality of the learned strategies. First, networks trained to reproduce the optimal fusion are dependent upon the number of documents to be retrieved. Applying networks trained at one retrieval level to other retrieval levels degrades performance e.g., networks trained

7 to retrieve 50 documents were worse at retrieving 200 documents than networks trained to retrieve 200 documents. Hence, every retrieval level requires optimizing an independent network. (Actually, it is possible to use networks optimized at a specific recall level over a range of levels. However, we can only empirically determine the effective range of a network.) By contrast, both MRDD and QC are almost completely independent of the number of documents to be retrieved. Only at the final step of the performance task is this information considered. Second, NN is critically dependent upon the particular collections in use. When collections are deleted, the optimal fusion must be recalculated for every old query and the networks must be reoptimized. When collections are added, either every old query must be run against the new collection, or the old queries must be thrown out. Neither of these alternatives is palatable. Conversely, adding or deleting collections is trivial for MRDD and QC because they treat each collection independently. Only at the last step of the performance task is the presence or absence of specific collections a consideration. The algorithms can also be compared in terms of their memory requirements. MRDD requires space for every query as well as the relevant document distributions for each collection for each query. The memory requirements of QC are much smaller. It requires storing only a small number of query vector centroids for each collection. NN may require the smallest amount of memory as each hidden unit is equivalent to one query vector. On the other hand, in the worst case each retrieval level requires a different network. Hence, the memory requirements of NN may exceed that of QC. Finally, these algorithms can be compared in terms of their speed during performance. QC is arguably the fastest performer. Its calculations are extremely simple and few are necessary. NN is only slightly, if at all, slower. It requires fewer calculations than QC, but they are more complex. The slowest of the systems is MRDD. For each collection, the current query must be compared to every previous query, and then a centroid must be computed over the relevant document distributions. To this point, time has not been a significant factor in our experiments. We expect that time will never be a significant factor because the retrieval of documents dominates any consideration of time. 4 EXPERIMENTS The experiments reported here test the ability of each of the three fusion methods described in the previous section to generalize from various amounts of training data. To ensure that the results were not affected by exogenous factors we used the following protocol. First, 25 queries were selected at random without replacement and set aside as a test set. From the remaining 74 queries, 10 queries were selected as the smallest training set. An additional 40 queries were then selected and added to the 10 query set. Finally, the remaining 24 queries were added to make our largest training set. Each of the systems was trained and tested using the sets so created. This procedure was repeated 15 times. The results in Figure 5 are averages over these 15 trials. (In addition, each neural network trial was repeated 11 times to average out the effect of network initializationand presentation order of the training examples.) Optimal fusion, single collection and uniform were also tested on the 15 test sets to allow these targets to be included in our statistical comparisons. In the rest of this section significant means that the difference was statistically significant with 99.9% confidence according to a one-tailed paired-sample t-test. There are seven notable trends in Figure The optimal fusion is always significantly superior to both the single collection and the three learning systems. As stated previously, this indicates that it may be possible to develop a collection fusion system that, while less than optimal, performs at least as well as a single collection. 2. The precision of the single collection is always significantly superior to that of the three learning systems. While the first result suggests that a fusion system may be able to exceed the performance of the single collection, none of the methods we studied do so. Still, as the following table indicates, given 74 training queries, MRDD is always within 10% of the single collection, except at 10 documents retrieved. number to retrieve Single MRDD QC NN The graph for 10 documents retrieved differs from the others. Only query clustering made significant improvements as the size of the training set increased. MRDD actually got slightly worse. The explanation for this is that, as previously noted, at small numbers of documents to be retrieved there is a bias towards AP in our set of queries. Each of the learning systems was able to learn this from the smallest trainingset so, there was little to learn from additional examples. 4. As the number of training examples increases there is significant improvement in the precision of each learning system at all recall levels other than 10 documents.

8 Optimal Fusion Single Collection MRDD QC NN Retrieving 10 documents 6. Other than at 10 examples to be retrieved, there is not a significant difference between QC and NN. 7. Every system is always significantly better than uniform sampling (data not shown) Number of Training Examples Retrieving 50 documents 5 DISCUSSION AND CONCLUSIONS In this paper we described the problem of collection fusion and three algorithms for that problem. Two of the algorithms we originated, the third is based upon neural networks. We showed that both of our algorithms are able to learn from past queries. The neural network we trained for the task was also able to learn from past queries. However, it never achieved the precision of our techniques Number of Training Examples Retrieving 100 documents Number of Training Examples Retrieving 200 documents Number of Training Examples Figure 5: Three collection fusion strategies tested at four retrieval levels. The scale on the Y axes of the graphs is not consistent. 5. Given equal number of training examples, MRDD was significantly superior to both QC and NN at both 50 and 100 documents retrieved. At 200 examples to be retrieved, MRDD was superior to both QC and NN, but not significantly. On the other hand, none of the techniques neared the precision of the single collection. The best, MRDD, is about 10% worse at the largest training set size we studied. We believe that additional training data will lessen this deficit. But, training data in this domain is expensive. Hence, it is appropriate to evaluate collection fusion systems trained on small amounts of data. One of the issues that we did not investigate in this study is that of training data. All three techniques assume that relevant document distributions are available for every query for every collection. In practice, this is unlikely to be the case. Users of an IR system are unlikely to spend time annotating lists of documents for their relevance to a particular query. Rather, users will simply select from a set of potentially relevant documents, those few that they wish to see. The documents selected by the user are likely relevant to the query. However there is no guarantee that documents not selected are irrelevant. These factors suggest that the data available to a learning system working on collection fusion is likely to be much less robust than the data with which we have been working. Practical systems for collection fusion must be able to cope with these problems. In addition, the user will likely be operating in a mode that adds or deletes collections. This is a problem for all three systems because new collections will not have data on the same set of queries as older collections. As noted in Section 3.4 this is an especially large problem for NN. The optimal fusion is dependent upon the particular collections to be fused. Deleting a collection merely requires reworking the optimal fusion. Adding collections causes more problems. New collections lack relevant document distributions for old queries. Hence, old queries must either be rerun to provide this information or thrown away. Neither solution is acceptable. In summary, we have described two algorithms for the collection fusion problem. The two algorithms differ in their memory requirements, speed on the performance task, and

9 their ability to solve the problem. If memory and time are not issues, then MRDD provides better solutions. On the other hand, QC sacrifices something from the solution for a considerable reduction in memory and increase in speed. Both systems, because they take full advantage of the data available on this problem, exceed the abilities of a standard (i.e., weak) machine learning algorithm. Moreover their precision is close to that of a single collection. While neither of our systems is robust to the issues that will face a real collection fusion system, they are a good first step. References Atlas, L., Cole, R., Muthusamy, Y., Lippman, A., Connor, J., Park, D., El-Sharkawi, M., & Marks, R. J. (1990). A peerformance comparison of trained multi-layer perceptrons and trained classification trees. Proceedings of IEEE, 78, Buckley, C. (1985). Implementation of the SMART Information Retrieval System. (Technical Report ), Ithaca, New York: Computer Science Department, Cornell University. Callan, J. P., Lu, Z., & Croft, W. B. (1995). Searching distributed collections with inference networks. Proceedings 1995 ACM SIGIR Conference on Research and Development in Information Retrieval. Dietterich, T. G., Hild, H., & Bakiri, G. (1990). A comparative study of ID3 and backpropagation for English textto-speech mapping. Proceedings of the Seventh International Conference on Machine Learning (pp ). Austin, TX. Gupta, N. & Voorhees, E. M. (1994). Genetic Algorithms for Learning Models of Information Servers. (Technical Report SCR-94-TR-501): Siemens Corporate Research, Inc. Harman, D. K. (1993). The first Text REtrieval Conference (TREC-1), Rockville, MD, U.S.A, 4 6 November, Information Processing and Management, 29(4), Moffat, A. & Zobel, J. (1994). Information retrieval for large document collections. Proceedings of the Third Text REtrieval Conference (TREC-3). In press. Salton, G. & Buckley, C. (1988). Term weighting approaches in automatic text retrieval. Information Processing and Management, 24, Salton, G. & McGill, M. J. (1983). Introduction to Modern Information Retrieval. New York: McGraw-Hill. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), Shavlik, J. W., Mooney, R. J., & Towell, G. G. (1991). Symbolic and neural net learning algorithms: An empirical comparison. Machine Learning, 6, Voorhees, E. M., Gupta, N. K., & Johnson-Laird, B. (1994). The collection fusion problem. Proceedings of the Third Text REtrieval Conference (TREC-3). In press. Voorhees, E. M., Gupta, N. K., & Johnson-Laird, B. (1995). Learning collection fusion strategies. Proceedings 1995 ACM SIGIR Conference on Research and Development in Information Retrieval. Appendix The following table gives the identification numbers of the 99 queries TREC queries used in this paper. prefered collection query numbers AP WSJ ZIFF Persons wishing to obtain the data used in this paper have two alternatives. The first is to obtain the entire TREC document collection. Contact Donna Harman at NIST for details. The second option is to use exactly the data used in paper. The data, and a description, are available via anonymous ftp from scr.siemens.com in the directory pub/learning/packages. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel Distributed Processing: Explorations in the microstructure of cognition. Volume 1: Foundations. Cambridge, MA: MIT Press.

Siemens TREC-4 Report: Further Experiments with Database. Merging. Ellen M. Voorhees. Siemens Corporate Research, Inc.

Siemens TREC-4 Report: Further Experiments with Database Merging Ellen M. Voorhees Siemens Corporate Research, Inc. Princeton, NJ ellen@scr.siemens.com Abstract A database merging technique is a strategy