Learning Collection Fusion Strategies for Information Retrieval

Size: px
Start display at page:

Download "Learning Collection Fusion Strategies for Information Retrieval"

Transcription

1 Appears in Proceedings of the Twelfth Annual Machine Learning Conference, Lake Tahoe, July 1995 Learning Collection Fusion Strategies for Information Retrieval Geoffrey Towell Ellen M. Voorhees Narendra K. Gupta Ben Johnson-Laird Siemens Corporate Research 755 College Road East Princeton, NJ Abstract In this paper we describe an Information Retrieval problem called collection fusion. The collection fusion problem is to maximize the number of relevant natural language documents retrieved given: a natural language query, multiple collections of documents, and a fixed total number of documents to retrieve. We describe two algorithms that use past queries to learn collection fusion strategies. Tests of these algorithms on a corpus of 742,000 documents indicate that they can learn good fusion strategies. Moreover, the strategies learned by our methods are consistently superior to those learned by a standard learning algorithm. 1 INTRODUCTION The goal of an information retrieval (IR) system is to find those documents that are relevant to a natural language query from within a collection of natural language documents. Many IR systems use statistical methods to assign a match score to each document in the collection rather than attempting to understand the text (Salton & McGill, 1983). Scores depend upon factors including, but not limited to: words in the query, words in the document, word frequencies in the document collection, the particular scoring function. Changes to any of these items affect scores. Typically, IR systems have assumed that the documents to be selected among are stored in a single, monolithic collection. Hence, the documents retrieved for a given query are those with the best score. However, there are cases in which it is natural for documents to be stored in several collections, more than one of which may contain documents of interest. For example, you might have collections of sports and business documents. A query about highly paid people might find relevant documents in both collections. When documents are stored in multiple collections, the IR problem is more complex because scores may not be compatible across collections. At the very least, the word frequencies in the document collection will vary. Hence, for a given query, identical documents in two collections can be expected to receive different scores. As a result, scores assigned to documents cannot be used in a straightforward manner to create a total ordering. Without a total ordering (or an alternative such as the optimal fusion see Section 2.4), it is difficult to determine the number of documents to retrieve from each collection. The problem of determining the number of documents to be retrieved from each collection in an environment with multiple document collections is collection fusion. Our goal is to develop collection fusion strategies and to investigate how effectively these strategies can learn from past queries. In general, the precision of our fusion methods is within 10% of the precision obtained when the collections are treated as a single collection. ( is a common measure in IR, it is the number of relevant documents retrieved divided by the number of documents retrieved.) In addition, our fusion methods are more effective than one based upon a standard learning approach. Details of our experiments are reported in Section 4. The next section is the underpinnings of this report. It contains a formal definition of the collection fusion problem and describes two simple, but ineffective, fusion strategies. This section also briefly describes the IR system used in all of our work, the targets to which our strategies will aspire, and the data set upon which our tests are executed. The succeeding section describes our two retrieval methods (Voorhees et al., 1994; Voorhees et al., 1995) and a method that uses neural networks. The final section analyses the strengths and weaknesses of these methods and discusses areas in need of further work.

2 2 UNDERPINNINGS 2.1 FORMAL DEFINITION OF COLLECTION FUSION Consider a set of document collections each of which is accessed through its information server. For a given query Q, each collection I has a certain number of relevant documents. We assume that when the server I is presented with Q, it returns a list of documents sorted by decreasing similarity to the query. We denote the distribution of relevant documents in the retrieved set (i.e. the ranks at which the relevant documents occur) by FQ(S), I a function that maps the number of retrieved documents, S, into a vector whose length is equal to the number of relevant documents in the S documents retrieved (see, for example, the first table in Figure 1). The collection fusion problem can now be formally stated as follows: Given: 1) Q, a query; 2) I1, I2, : : :, I C, information servers; 3) N, the total number of documents to be retrieved; Find: C i=1 i = N and P C i=1 j F Ii Q ( i ) j is maximum values of 1, 2, : : : C such that: That is, the goal is to maximize the total number of relevant documents retrieved. In practice, FQ I is not known and so must be approximated. 2.2 THE TREC DOCUMENT COLLECTION All of the retrieval runs use the approximately 742,000 documents and 200 queries and relevance assessments in the TREC collection (Harman, 1993). This collection is a standard in the IR community. It consists of five subcollections that are of different sizes, cover diverse topics, have different retrieval characteristics, and come from different sources. The sources of the documents are: the A.P. newswire (referred to hereafter as AP), U.S. Department of Energy publications (DOE), the Federal Register (FR), the Wall Street Journal (WSJ), and Ziff-Davis Publishing Computer Selects (ZIFF). We could not use all 200 queries in the TREC collection because a large majority of the queries preferentially select AP. As a result, each of the methods described in Section 3 learns after seeing very few queries that the best strategy is to retrieve documents from AP. Hence, the set of all 200 queries does not adequately test the efficacy of the algorithms. So, we selected a 99 member subset of the queries such that the three largest sub-collections, AP, WSJ, and ZIFF, are preferentially selected by one-third of the queries when a large number of documents are retrieved. Specifically, we selected the 33 queries for these three datasets that maximized number of relevant documents in collection number of relevant documents in AP; W SJ and ZIF F This measure is independent of the underlying IR system; it is the only result in this paper for which this is the case. For all but 3 queries selected by this metric, the number of relevant documents in the prefered collection is greater than the total number of relevant documents in the other collections. (The DOE and FR collections are preferentially selected by very few queries so it is not possible to select a set of queries balanced across all five document sources. Hence, we did not consider them in our selection procedure. Their inclusion would have changed 7 of the 99 queries in our test collection.) Although we chose queries for their prefered collection, when a small number of documents are retrieved (using the IR system described next) these queries have a bias towards AP. For example, when 10 documents are retrieved from each collection, AP has the largest number of relevant documents 51 times. Instructions for obtaining the TREC collection, and a list of the 99 queries we used, are included as an appendix. 2.3 UNDERLYING IR SYSTEM Throughout the remainder of this paper, natural language is not used directly. Rather, we use the vector space model of IR (Salton et al., 1975). In this model, documents and queries are translated into vectors using the following four-step process: 1. remove suffixes (e.g., remove ing from sleeping ); 2. remove stop words (e.g., the, of ); 3. fold the remaining words into a vector in which each position corresponds to a stemmed word; 4. fill the vector with word frequencies. We use the SMART (Buckley, 1985) system as the IR system underlying all of our tests. That is, SMART is used to create the document and query vectors and to score the relationship between documents and queries. In SMART, we used: document vectors as described, query vectors as described except that the frequency of each word was divided by the number of documents in which it occurred in the given collection, and the cosine similarity measure. This combination has proven effective (Salton & Buckley, 1988).

3 For three collections, labeled A, B, and C, the positions of the relevant documents in the 10 most highly ranked documents for a given query, i.e., FQ I (10), is given in the following table. Collection Relevant Document Distribution A B C The following table gives the number of documents to retrieve from each collection given a number of documents to retrieve. Number of Number of documents to Number of documents retrieve from Collection relevant to retrieve A B C notes retrieved plus any plus any tied TARGETS Figure 1: Optimal Fusion We compare our learned fusion strategies to two targets: single collection precision and the optimal fusion. Single collection approximates the ultimate ability of the retrieval system. It assumes that the user can merge all subcollections into a single, monolithic collection. Optimal fusion is a retrospective technique that gives an upper bound for the effectiveness of any fusion strategy. It uses the actual distribution of the relevant documents (F I Q above) to maximize the number of relevant documents retrieved. Figure 1 illustrates this procedure. Figure 2 shows the effectiveness of these two targets as the number of documents to be retrieved is varied. For both targets, as the number of documents to be retrieved increases, the precision decreases. This is as expected, there are a limited number of relevant documents. More interesting is that the precision of the optimal fusion is consistently superior to that of the single collection. This result is interesting because it is encouraging; fusion systems that are less than optimal may still be as effective as a single collection. 2.5 SIMPLE COLLECTION FUSION STRATEGIES Perhaps the simplest fusion strategy, which we will call uniform, is based on the assumption that every collection has the same distribution of relevant documents for every possible query (e.g., if this were true, then in Figure 1 the rows in the first table would always be identical). Under this assumption, retrieving an equal number of documents from each collection will, on average, maximize the number of relevant documents retrieved. In practice, this assumption Optimal fusion Single Collection Merge-sort Uniform Number of documents to be retrieved Figure 2: The precision of two simple fusion strategies and two targets. is not valid. Different collections have different specialties and thus do not have equal numbers of relevant documents. Given different numbers of relevant documents, it is difficult to imagine that relevant documents will be distributed in exactly the same way when the documents are sorted by relevance to a query. The retrieval results in Figure 2 show that the uniform method is ineffective for the TREC collection. By comparison to both the single collection and optimal fusion, the strategy is dreadful. A more promising approach to fusion, which we call merge-sort, is to assume that the relevance scores attributed to documents are comparable across collections and that the scores are accessible. Using merge-sort, the document scores are used to merge the documents from individual collections into a totally ordered list. The problem with this approach is that the assumption of comparable relevance scores is incorrect. At least, each collection will have different distributions of words. In our use of SMART, this difference affects the inverse document frequency weighting of the query vectors. Hence, even if exactly the same retrieval system is used on every collection, the results of the merge-sort fusion will likely be inferior to that of the single collection. If different retrieval systems are used, then the scores attributed to identical documents in different collections may be totally incompatible. (Callan, Lu and Croft (1995) propose methods for making the scores more compatible.) Figure 2 shows merge-sort fusion by comparison to the optimal fusion and the single collection. The merge-sort result is only slightly inferior to that of the single collection. This is as expected, the retrieval system is constant thereby avoiding the major problems of this approach. Still, analysis of individual queries reveals flaws that result from different distributions within collections. For instance consider the TREC query: : : : management effectiveness of the United Nations (UN) and related organizations : : : When

4 retrieving over individual collections, the highest ranked document is in the DOE collection. The abstract of that document begins: The relations of the IAEA with the United Nations, with subsidiary UN organs, with other UN specialized agencies, : : : This document receives a high score because both United Nations and organization are rare terms in the DOE collection. Thus irrelevant documents in the DOE collection that contain these terms have higher similarity than do the relevant documents in the AP collection. 2.6 RELATED WORK We are aware of only two related papers. The first, by Moffet and Zobel (1994), views the goal of collection fusion as reducing the load on a central document server. Hence, they implement collection fusion as a coarse-to-fine search. The central document server has complete statistics about each document collection but uses that information only as a coarse filter, attempting to recognize and eliminate irrelevant collections. This contrasts sharply with our assumption that no machine has access to complete statistics about all documents. The other paper on collection fusion, by Callan, Lu and Croft (1995), relaxes the assumptions of Moffet and Zobel. Whereas Moffet and Zobel assume that the central machine has access to complete statistics on each document, Callan, Lu and Croft assume that the querying machine has complete statistics about each document collection. The document collection statistics are then assembled into a structure that ranks the relevance of each collection to a given query. Because of the difference in assumptions about the available information, neither of these techniques is directly comparable those described in this paper. Hence, neither of these systems is included in the empirical investigation reported in Section 4. 3 LEARNING COLLECTION FUSION STRATEGIES This section describes three approaches to learning collection fusion strategies. We developed the first two strategies specifically for this problem. The third strategy uses neural networks. We chose to use neural networks as an example of generic learning algorithms because they have proven to be at least as effective as other learning algorithms (Atlas et al., 1990; Shavlik et al., 1991) and they are naturally able to learn functions with real-valued outputs. We also evaluated genetic algorithms on this problem (Gupta & Voorhees, 1994). We do not report these results because genetic algorithms did not match the ability of neural networks. The strategy based upon neural networks is essentially a straw man against which we will evaluate our systems. We look upon neural networks as a straw man because we believe that generic learning algorithms will not be able to adequately learn a solution to this task. Rather we believe that to solve this task it is necessary to build specialized algorithms. Analysis of the three algorithms in Section 3.4 and empirical tests in Section 4 support this claim. 3.1 MODELING RELEVANT DOCUMENT DISTRIBUTIONS We call the first algorithm modeling relevant document distributions(mrdd). The algorithm models the relevant document distribution of a query, q, by averaging the relevant document distributions of the k most similar training queries (i.e., the k nearest neighbors of q). MRDD uses query vectors as described in Section 2.3 and cosine distance to determine the similarity between vectors. For all three of the techniques described in this section, the query vectors are built in a space containing only the training queries. As depicted in Figure 3 the first step of MRDD is to find the k most similar training queries. Then, for each collection, determine the average distribution of relevant documents across the k queries. For example, suppose the rows in the top table of Figure 1 represent the relevant document distributionfor a single collection of three different queries. The the average relevant document distribution is: [ ] Once the average relevant document distribution is computed for the current query for each collection, the distributions and the total number of documents to be retrieved are passed to a maximization procedure that determines the optimal fusion given the estimates of FQ I for each collection (step 2 in Figure 3). This procedure finds the cut-off level, i, for each collection that maximizes the number of relevant documents retrieved. (The current maximization procedure simply does an exhaustive search). The computed cut-off levels are the number of documents selected from each collection. Finally (in step 3 of Figure 3), a total ordering is imposed on the model distributions by a random process biased in favor of collections with many remaining documents. This total ordering is used simply to determine the order in which documents presented to a user. Tests revealed that k = 8 performed as well or better than other values (Voorhees et al., 1995). Hence, all tests reported in the next section use this setting.

5 1. Build data structures from M training queries: M=3 training queries C=3 collections a collection of M query vectors distribution of relevant documents for each of M queries in each of C collections 2. Predict the number of documents to retrieve from each collection for a new query: N MAXIMIZATION λ1 λ2 λ3 compute the average distribution of k nearest neighbors in each collection use maximization procedure on N and average distributions to select collection cut-off levels 3. Form ranked result for query: λ1 λ3 λ2 form union of top λc documents from each collection Figure 3: The relevant document distribution fusion strategy. 3.2 QUERY CLUSTERING The second fusion strategy, query clustering (QC), does not form an explicit model of a collection s relevant document distribution. Instead, QC learns a measure of the quality of a search for a particular topic area on the collection. The number of documents retrieved from a collection for a new query is proportional to the value of the quality measure for that query. As in MRDD, QC uses query vectors to represent the queries. Topic areas are represented as centroids of query clusters. For each collection, the set of training queries is clustered using the number of documents retrieved in common between two queries as a similarity measure. The assumption is that if two queries retrieve many documents in common they are about the same topic. The centroid of a query cluster is created by averaging the vectors of the queries contained within the cluster. This centroid is the system s representation of the topic covered by that query cluster. The training phase also assigns to a cluster a weight that reflects how effective queries in the cluster are on that collection. The weight is computed as the average number of relevant documents retrieved by queries in the cluster, where a document is retrieved if it was among the first L (a parameter of the method) documents. After training, queries are processed as shown in Figure 4. The cluster whose centroid vector is most similar to the query vector is selected for the query and the associated weight is returned. The set of weights returned by all the collections are used to apportion the retrieved set such that when N documents are to be returned and w i is the weight w returned by collection i, P i C N (rounded appropriately) documents are retrieved from collection i. For exam- i=1 wi ple, assume the total number of documents to be retrieved is 100, and there are five collections. If the weights returned by the collections are 4, 3, 3, 0, 2, then 33 documents would be retrieved from collection 1, 25 each from collections 2 and 3, none from collection 4, and 17 from collection 5. The weight of a cluster for a single collection in isolation is not meaningful; it is the relative difference in weights returned by the set of collections over which the fusion is to be performed that is important. The only parameter affecting the performance of query clustering is L. Testing indicated that L = 100 is as effective or better than other values (Voorhees et al., 1995).

6 1. Build data structures from M training queries: set of query clusters per collection centroid vector and quality weight per cluster for each collection 2. Form ranked retrieved set for new query: apportion retrieved set according to weights find closest centroid in each collection and return corresponding weight Figure 4: The query clustering fusion strategy. 3.3 NEURAL NETWORKS To apply neural networks (NN) to this task, we trained feedforward networks (Rumelhart et al., 1986) to learn the optimal fusion for a given number of documents to be retrieved. We did this by creating networks with one output unit per collection. (An alternative would be to train one network per collection, i.e., create five networks each with a single output unit. We did not pursue this design because Dietterich s (1990) results on text-to-speech mapping indicate that it is not beneficial.) The target vector was the optimal fusion normalized by the number of documents to be retrieved. As input to the network, we used the same term-frequency weighted query vectors as MRDD and QC. This created networks with approximately 1600 input units and 5 output units. After testing many network configurations, we settled on one with a single, completelyconnected, layer of 10 hidden units. The optimal fusion is not easily learned; it is neither a smooth nor a consistent function. Consider again Figure 1, in particular the second table. At several places, this table notes that one or two documents may be taken from any collection. More of a problem is the transition from 7 to 9 documents to be retrieved. At this point, the optimal fusion changes radically. Such changes are both difficult to learn and almost certainly idiosyncratic to the particular query. This combination is unfortunate. The network will likely spend a large fraction of its resources learning these idiosyncrasies when they probably should be given little consideration. Some of the problems with the lack of smoothness in the output function are ameliorated by evaluating the networks in terms of their precision rather than their ability to reproduce the optimal fusion. Frequently major errors according to the optimal fusion are small or nonexistent in terms of precision. For instance, the difference between the two optimal fusions for eight documents retrieved in Figure 1 is quite large. However, in terms of precision, there is no difference. 3.4 ANALYSIS OF THE LEARNING SYSTEMS In addition to the surface differences in the three algorithms described in this section, there are two important differences in the generality of the learned strategies. First, networks trained to reproduce the optimal fusion are dependent upon the number of documents to be retrieved. Applying networks trained at one retrieval level to other retrieval levels degrades performance e.g., networks trained

7 to retrieve 50 documents were worse at retrieving 200 documents than networks trained to retrieve 200 documents. Hence, every retrieval level requires optimizing an independent network. (Actually, it is possible to use networks optimized at a specific recall level over a range of levels. However, we can only empirically determine the effective range of a network.) By contrast, both MRDD and QC are almost completely independent of the number of documents to be retrieved. Only at the final step of the performance task is this information considered. Second, NN is critically dependent upon the particular collections in use. When collections are deleted, the optimal fusion must be recalculated for every old query and the networks must be reoptimized. When collections are added, either every old query must be run against the new collection, or the old queries must be thrown out. Neither of these alternatives is palatable. Conversely, adding or deleting collections is trivial for MRDD and QC because they treat each collection independently. Only at the last step of the performance task is the presence or absence of specific collections a consideration. The algorithms can also be compared in terms of their memory requirements. MRDD requires space for every query as well as the relevant document distributions for each collection for each query. The memory requirements of QC are much smaller. It requires storing only a small number of query vector centroids for each collection. NN may require the smallest amount of memory as each hidden unit is equivalent to one query vector. On the other hand, in the worst case each retrieval level requires a different network. Hence, the memory requirements of NN may exceed that of QC. Finally, these algorithms can be compared in terms of their speed during performance. QC is arguably the fastest performer. Its calculations are extremely simple and few are necessary. NN is only slightly, if at all, slower. It requires fewer calculations than QC, but they are more complex. The slowest of the systems is MRDD. For each collection, the current query must be compared to every previous query, and then a centroid must be computed over the relevant document distributions. To this point, time has not been a significant factor in our experiments. We expect that time will never be a significant factor because the retrieval of documents dominates any consideration of time. 4 EXPERIMENTS The experiments reported here test the ability of each of the three fusion methods described in the previous section to generalize from various amounts of training data. To ensure that the results were not affected by exogenous factors we used the following protocol. First, 25 queries were selected at random without replacement and set aside as a test set. From the remaining 74 queries, 10 queries were selected as the smallest training set. An additional 40 queries were then selected and added to the 10 query set. Finally, the remaining 24 queries were added to make our largest training set. Each of the systems was trained and tested using the sets so created. This procedure was repeated 15 times. The results in Figure 5 are averages over these 15 trials. (In addition, each neural network trial was repeated 11 times to average out the effect of network initializationand presentation order of the training examples.) Optimal fusion, single collection and uniform were also tested on the 15 test sets to allow these targets to be included in our statistical comparisons. In the rest of this section significant means that the difference was statistically significant with 99.9% confidence according to a one-tailed paired-sample t-test. There are seven notable trends in Figure The optimal fusion is always significantly superior to both the single collection and the three learning systems. As stated previously, this indicates that it may be possible to develop a collection fusion system that, while less than optimal, performs at least as well as a single collection. 2. The precision of the single collection is always significantly superior to that of the three learning systems. While the first result suggests that a fusion system may be able to exceed the performance of the single collection, none of the methods we studied do so. Still, as the following table indicates, given 74 training queries, MRDD is always within 10% of the single collection, except at 10 documents retrieved. number to retrieve Single MRDD QC NN The graph for 10 documents retrieved differs from the others. Only query clustering made significant improvements as the size of the training set increased. MRDD actually got slightly worse. The explanation for this is that, as previously noted, at small numbers of documents to be retrieved there is a bias towards AP in our set of queries. Each of the learning systems was able to learn this from the smallest trainingset so, there was little to learn from additional examples. 4. As the number of training examples increases there is significant improvement in the precision of each learning system at all recall levels other than 10 documents.

8 Optimal Fusion Single Collection MRDD QC NN Retrieving 10 documents 6. Other than at 10 examples to be retrieved, there is not a significant difference between QC and NN. 7. Every system is always significantly better than uniform sampling (data not shown) Number of Training Examples Retrieving 50 documents 5 DISCUSSION AND CONCLUSIONS In this paper we described the problem of collection fusion and three algorithms for that problem. Two of the algorithms we originated, the third is based upon neural networks. We showed that both of our algorithms are able to learn from past queries. The neural network we trained for the task was also able to learn from past queries. However, it never achieved the precision of our techniques Number of Training Examples Retrieving 100 documents Number of Training Examples Retrieving 200 documents Number of Training Examples Figure 5: Three collection fusion strategies tested at four retrieval levels. The scale on the Y axes of the graphs is not consistent. 5. Given equal number of training examples, MRDD was significantly superior to both QC and NN at both 50 and 100 documents retrieved. At 200 examples to be retrieved, MRDD was superior to both QC and NN, but not significantly. On the other hand, none of the techniques neared the precision of the single collection. The best, MRDD, is about 10% worse at the largest training set size we studied. We believe that additional training data will lessen this deficit. But, training data in this domain is expensive. Hence, it is appropriate to evaluate collection fusion systems trained on small amounts of data. One of the issues that we did not investigate in this study is that of training data. All three techniques assume that relevant document distributions are available for every query for every collection. In practice, this is unlikely to be the case. Users of an IR system are unlikely to spend time annotating lists of documents for their relevance to a particular query. Rather, users will simply select from a set of potentially relevant documents, those few that they wish to see. The documents selected by the user are likely relevant to the query. However there is no guarantee that documents not selected are irrelevant. These factors suggest that the data available to a learning system working on collection fusion is likely to be much less robust than the data with which we have been working. Practical systems for collection fusion must be able to cope with these problems. In addition, the user will likely be operating in a mode that adds or deletes collections. This is a problem for all three systems because new collections will not have data on the same set of queries as older collections. As noted in Section 3.4 this is an especially large problem for NN. The optimal fusion is dependent upon the particular collections to be fused. Deleting a collection merely requires reworking the optimal fusion. Adding collections causes more problems. New collections lack relevant document distributions for old queries. Hence, old queries must either be rerun to provide this information or thrown away. Neither solution is acceptable. In summary, we have described two algorithms for the collection fusion problem. The two algorithms differ in their memory requirements, speed on the performance task, and

9 their ability to solve the problem. If memory and time are not issues, then MRDD provides better solutions. On the other hand, QC sacrifices something from the solution for a considerable reduction in memory and increase in speed. Both systems, because they take full advantage of the data available on this problem, exceed the abilities of a standard (i.e., weak) machine learning algorithm. Moreover their precision is close to that of a single collection. While neither of our systems is robust to the issues that will face a real collection fusion system, they are a good first step. References Atlas, L., Cole, R., Muthusamy, Y., Lippman, A., Connor, J., Park, D., El-Sharkawi, M., & Marks, R. J. (1990). A peerformance comparison of trained multi-layer perceptrons and trained classification trees. Proceedings of IEEE, 78, Buckley, C. (1985). Implementation of the SMART Information Retrieval System. (Technical Report ), Ithaca, New York: Computer Science Department, Cornell University. Callan, J. P., Lu, Z., & Croft, W. B. (1995). Searching distributed collections with inference networks. Proceedings 1995 ACM SIGIR Conference on Research and Development in Information Retrieval. Dietterich, T. G., Hild, H., & Bakiri, G. (1990). A comparative study of ID3 and backpropagation for English textto-speech mapping. Proceedings of the Seventh International Conference on Machine Learning (pp ). Austin, TX. Gupta, N. & Voorhees, E. M. (1994). Genetic Algorithms for Learning Models of Information Servers. (Technical Report SCR-94-TR-501): Siemens Corporate Research, Inc. Harman, D. K. (1993). The first Text REtrieval Conference (TREC-1), Rockville, MD, U.S.A, 4 6 November, Information Processing and Management, 29(4), Moffat, A. & Zobel, J. (1994). Information retrieval for large document collections. Proceedings of the Third Text REtrieval Conference (TREC-3). In press. Salton, G. & Buckley, C. (1988). Term weighting approaches in automatic text retrieval. Information Processing and Management, 24, Salton, G. & McGill, M. J. (1983). Introduction to Modern Information Retrieval. New York: McGraw-Hill. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), Shavlik, J. W., Mooney, R. J., & Towell, G. G. (1991). Symbolic and neural net learning algorithms: An empirical comparison. Machine Learning, 6, Voorhees, E. M., Gupta, N. K., & Johnson-Laird, B. (1994). The collection fusion problem. Proceedings of the Third Text REtrieval Conference (TREC-3). In press. Voorhees, E. M., Gupta, N. K., & Johnson-Laird, B. (1995). Learning collection fusion strategies. Proceedings 1995 ACM SIGIR Conference on Research and Development in Information Retrieval. Appendix The following table gives the identification numbers of the 99 queries TREC queries used in this paper. prefered collection query numbers AP WSJ ZIFF Persons wishing to obtain the data used in this paper have two alternatives. The first is to obtain the entire TREC document collection. Contact Donna Harman at NIST for details. The second option is to use exactly the data used in paper. The data, and a description, are available via anonymous ftp from scr.siemens.com in the directory pub/learning/packages. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel Distributed Processing: Explorations in the microstructure of cognition. Volume 1: Foundations. Cambridge, MA: MIT Press.

Siemens TREC-4 Report: Further Experiments with Database. Merging. Ellen M. Voorhees. Siemens Corporate Research, Inc.

Siemens TREC-4 Report: Further Experiments with Database. Merging. Ellen M. Voorhees. Siemens Corporate Research, Inc. Siemens TREC-4 Report: Further Experiments with Database Merging Ellen M. Voorhees Siemens Corporate Research, Inc. Princeton, NJ ellen@scr.siemens.com Abstract A database merging technique is a strategy

More information

Using Coherence-based Measures to Predict Query Difficulty

Using Coherence-based Measures to Predict Query Difficulty Using Coherence-based Measures to Predict Query Difficulty Jiyin He, Martha Larson, and Maarten de Rijke ISLA, University of Amsterdam {jiyinhe,larson,mdr}@science.uva.nl Abstract. We investigate the potential

More information

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst An Evaluation of Information Retrieval Accuracy with Simulated OCR Output W.B. Croft y, S.M. Harding y, K. Taghva z, and J. Borsack z y Computer Science Department University of Massachusetts, Amherst

More information

RMIT University at TREC 2006: Terabyte Track

RMIT University at TREC 2006: Terabyte Track RMIT University at TREC 2006: Terabyte Track Steven Garcia Falk Scholer Nicholas Lester Milad Shokouhi School of Computer Science and IT RMIT University, GPO Box 2476V Melbourne 3001, Australia 1 Introduction

More information

Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied

Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied Information Processing and Management 43 (2007) 1044 1058 www.elsevier.com/locate/infoproman Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied Anselm Spoerri

More information

TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University

TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University of Maryland, College Park, MD 20742 oard@glue.umd.edu

More information

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long

More information

Retrieval Evaluation. Hongning Wang

Retrieval Evaluation. Hongning Wang Retrieval Evaluation Hongning Wang CS@UVa What we have learned so far Indexed corpus Crawler Ranking procedure Research attention Doc Analyzer Doc Rep (Index) Query Rep Feedback (Query) Evaluation User

More information

A New Measure of the Cluster Hypothesis

A New Measure of the Cluster Hypothesis A New Measure of the Cluster Hypothesis Mark D. Smucker 1 and James Allan 2 1 Department of Management Sciences University of Waterloo 2 Center for Intelligent Information Retrieval Department of Computer

More information

Making Retrieval Faster Through Document Clustering

Making Retrieval Faster Through Document Clustering R E S E A R C H R E P O R T I D I A P Making Retrieval Faster Through Document Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-02 January 23, 2004 D a l l e M o l l e I n s t i t u t e

More information

Patent Classification Using Ontology-Based Patent Network Analysis

Patent Classification Using Ontology-Based Patent Network Analysis Association for Information Systems AIS Electronic Library (AISeL) PACIS 2010 Proceedings Pacific Asia Conference on Information Systems (PACIS) 2010 Patent Classification Using Ontology-Based Patent Network

More information

An Improvement of Centroid-Based Classification Algorithm for Text Classification

An Improvement of Centroid-Based Classification Algorithm for Text Classification An Improvement of Centroid-Based Classification Algorithm for Text Classification Zehra Cataltepe, Eser Aygun Istanbul Technical Un. Computer Engineering Dept. Ayazaga, Sariyer, Istanbul, Turkey cataltepe@itu.edu.tr,

More information

Syskill & Webert: Identifying interesting web sites

Syskill & Webert: Identifying interesting web sites Syskill & Webert Page 1 of 10 Syskill & Webert: Identifying interesting web sites Abstract Michael Pazzani, Jack Muramatsu & Daniel Billsus Department of Information and Computer Science University of

More information

Robust Relevance-Based Language Models

Robust Relevance-Based Language Models Robust Relevance-Based Language Models Xiaoyan Li Department of Computer Science, Mount Holyoke College 50 College Street, South Hadley, MA 01075, USA Email: xli@mtholyoke.edu ABSTRACT We propose a new

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information

More information

Building Test Collections. Donna Harman National Institute of Standards and Technology

Building Test Collections. Donna Harman National Institute of Standards and Technology Building Test Collections Donna Harman National Institute of Standards and Technology Cranfield 2 (1962-1966) Goal: learn what makes a good indexing descriptor (4 different types tested at 3 levels of

More information

A Formal Approach to Score Normalization for Meta-search

A Formal Approach to Score Normalization for Meta-search A Formal Approach to Score Normalization for Meta-search R. Manmatha and H. Sever Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst, MA 01003

More information

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 H. Altay Güvenir and Aynur Akkuş Department of Computer Engineering and Information Science Bilkent University, 06533, Ankara, Turkey

More information

DATABASE MERGING STRATEGY BASED ON LOGISTIC REGRESSION

DATABASE MERGING STRATEGY BASED ON LOGISTIC REGRESSION DATABASE MERGING STRATEGY BASED ON LOGISTIC REGRESSION Anne Le Calvé, Jacques Savoy Institut interfacultaire d'informatique Université de Neuchâtel (Switzerland) e-mail: {Anne.Lecalve, Jacques.Savoy}@seco.unine.ch

More information

Rank Measures for Ordering

Rank Measures for Ordering Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many

More information

Hierarchical Clustering 4/5/17

Hierarchical Clustering 4/5/17 Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree with data points as leaves. Useful for explaining the training data. Not useful for making new predictions. Direction

More information

Web Information Retrieval using WordNet

Web Information Retrieval using WordNet Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT

More information

Performance Measures for Multi-Graded Relevance

Performance Measures for Multi-Graded Relevance Performance Measures for Multi-Graded Relevance Christian Scheel, Andreas Lommatzsch, and Sahin Albayrak Technische Universität Berlin, DAI-Labor, Germany {christian.scheel,andreas.lommatzsch,sahin.albayrak}@dai-labor.de

More information

ITERATIVE SEARCHING IN AN ONLINE DATABASE. Susan T. Dumais and Deborah G. Schmitt Cognitive Science Research Group Bellcore Morristown, NJ

ITERATIVE SEARCHING IN AN ONLINE DATABASE. Susan T. Dumais and Deborah G. Schmitt Cognitive Science Research Group Bellcore Morristown, NJ - 1 - ITERATIVE SEARCHING IN AN ONLINE DATABASE Susan T. Dumais and Deborah G. Schmitt Cognitive Science Research Group Bellcore Morristown, NJ 07962-1910 ABSTRACT An experiment examined how people use

More information

An Attempt to Identify Weakest and Strongest Queries

An Attempt to Identify Weakest and Strongest Queries An Attempt to Identify Weakest and Strongest Queries K. L. Kwok Queens College, City University of NY 65-30 Kissena Boulevard Flushing, NY 11367, USA kwok@ir.cs.qc.edu ABSTRACT We explore some term statistics

More information

M erg in g C lassifiers for Im p ro v ed In fo rm a tio n R e triev a l

M erg in g C lassifiers for Im p ro v ed In fo rm a tio n R e triev a l M erg in g C lassifiers for Im p ro v ed In fo rm a tio n R e triev a l Anette Hulth, Lars Asker Dept, of Computer and Systems Sciences Stockholm University [hulthi asker]ø dsv.su.s e Jussi Karlgren Swedish

More information

Appears in Proceedings of the International Joint Conference on Neural Networks (IJCNN-92), Baltimore, MD, vol. 2, pp. II II-397, June, 1992

Appears in Proceedings of the International Joint Conference on Neural Networks (IJCNN-92), Baltimore, MD, vol. 2, pp. II II-397, June, 1992 Appears in Proceedings of the International Joint Conference on Neural Networks (IJCNN-92), Baltimore, MD, vol. 2, pp. II-392 - II-397, June, 1992 Growing Layers of Perceptrons: Introducing the Extentron

More information

From Passages into Elements in XML Retrieval

From Passages into Elements in XML Retrieval From Passages into Elements in XML Retrieval Kelly Y. Itakura David R. Cheriton School of Computer Science, University of Waterloo 200 Univ. Ave. W. Waterloo, ON, Canada yitakura@cs.uwaterloo.ca Charles

More information

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track Alejandro Bellogín 1,2, Thaer Samar 1, Arjen P. de Vries 1, and Alan Said 1 1 Centrum Wiskunde

More information

Percent Perfect Performance (PPP)

Percent Perfect Performance (PPP) Percent Perfect Performance (PPP) Information Processing & Management, 43 (4), 2007, 1020-1029 Robert M. Losee CB#3360 University of North Carolina Chapel Hill, NC 27599-3360 email: losee at unc period

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Learning from hotlists and coldlists: Towards a WWW information filtering and seeking agent

Learning from hotlists and coldlists: Towards a WWW information filtering and seeking agent TAI Coldlist final Page 1 of 7 Learning from hotlists and coldlists: Towards a WWW information filtering and seeking agent Abstract Michael Pazzani, Larry Nguyen & Stefanus Mantik Department of Information

More information

Amit Singhal, Chris Buckley, Mandar Mitra. Department of Computer Science, Cornell University, Ithaca, NY 14853

Amit Singhal, Chris Buckley, Mandar Mitra. Department of Computer Science, Cornell University, Ithaca, NY 14853 Pivoted Document Length Normalization Amit Singhal, Chris Buckley, Mandar Mitra Department of Computer Science, Cornell University, Ithaca, NY 8 fsinghal, chrisb, mitrag@cs.cornell.edu Abstract Automatic

More information

Spatially-Aware Information Retrieval on the Internet

Spatially-Aware Information Retrieval on the Internet Spatially-Aware Information Retrieval on the Internet SPIRIT is funded by EU IST Programme Contract Number: Abstract Multi-Attribute Similarity Ranking Deliverable number: D17:5301 Deliverable type: R

More information

Assignment No. 1. Abdurrahman Yasar. June 10, QUESTION 1

Assignment No. 1. Abdurrahman Yasar. June 10, QUESTION 1 COMPUTER ENGINEERING DEPARTMENT BILKENT UNIVERSITY Assignment No. 1 Abdurrahman Yasar June 10, 2014 1 QUESTION 1 Consider the following search results for two queries Q1 and Q2 (the documents are ranked

More information

Constructing Hidden Units using Examples and Queries

Constructing Hidden Units using Examples and Queries Constructing Hidden Units using Examples and Queries Eric B. Baum Kevin J. Lang NEC Research Institute 4 Independence Way Princeton, NJ 08540 ABSTRACT While the network loading problem for 2-layer threshold

More information

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks University of Amsterdam at INEX 2010: Ad hoc and Book Tracks Jaap Kamps 1,2 and Marijn Koolen 1 1 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 2 ISLA, Faculty of Science,

More information

Similarity search in multimedia databases

Similarity search in multimedia databases Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:

More information

High Dimensional Indexing by Clustering

High Dimensional Indexing by Clustering Yufei Tao ITEE University of Queensland Recall that, our discussion so far has assumed that the dimensionality d is moderately high, such that it can be regarded as a constant. This means that d should

More information

Using Statistical Properties of Text to Create. Metadata. Computer Science and Electrical Engineering Department

Using Statistical Properties of Text to Create. Metadata. Computer Science and Electrical Engineering Department Using Statistical Properties of Text to Create Metadata Grace Crowder crowder@cs.umbc.edu Charles Nicholas nicholas@cs.umbc.edu Computer Science and Electrical Engineering Department University of Maryland

More information

Power System Security Boundary Enhancement Using Evolutionary-Based Query Learning

Power System Security Boundary Enhancement Using Evolutionary-Based Query Learning C.A. Jensen, M.A. El-Sharkawi and R.J. Marks II, "Power Security Boundary Enhancement Using Evolutionary-Based Query Learning", Engineering Intelligent Systems, vol.7, no.9, pp.215-218 (December 1999).

More information

Semi supervised clustering for Text Clustering

Semi supervised clustering for Text Clustering Semi supervised clustering for Text Clustering N.Saranya 1 Assistant Professor, Department of Computer Science and Engineering, Sri Eshwar College of Engineering, Coimbatore 1 ABSTRACT: Based on clustering

More information

Overview of the TREC 2013 Crowdsourcing Track

Overview of the TREC 2013 Crowdsourcing Track Overview of the TREC 2013 Crowdsourcing Track Mark D. Smucker 1, Gabriella Kazai 2, and Matthew Lease 3 1 Department of Management Sciences, University of Waterloo 2 Microsoft Research, Cambridge, UK 3

More information

Using Mutation to Automatically Suggest Fixes for Faulty Programs

Using Mutation to Automatically Suggest Fixes for Faulty Programs 2010 Third International Conference on Software Testing, Verification and Validation Using Mutation to Automatically Suggest Fixes for Faulty Programs Vidroha Debroy and W. Eric Wong Department of Computer

More information

Complexity Measures for Map-Reduce, and Comparison to Parallel Computing

Complexity Measures for Map-Reduce, and Comparison to Parallel Computing Complexity Measures for Map-Reduce, and Comparison to Parallel Computing Ashish Goel Stanford University and Twitter Kamesh Munagala Duke University and Twitter November 11, 2012 The programming paradigm

More information

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion Ye Tian, Gary M. Weiss, Qiang Ma Department of Computer and Information Science Fordham University 441 East Fordham

More information

An Active Learning Approach to Efficiently Ranking Retrieval Engines

An Active Learning Approach to Efficiently Ranking Retrieval Engines Dartmouth College Computer Science Technical Report TR3-449 An Active Learning Approach to Efficiently Ranking Retrieval Engines Lisa A. Torrey Department of Computer Science Dartmouth College Advisor:

More information

Retrieval Evaluation

Retrieval Evaluation Retrieval Evaluation - Reference Collections Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, Chapter

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Citation for published version (APA): He, J. (2011). Exploring topic structure: Coherence, diversity and relatedness

Citation for published version (APA): He, J. (2011). Exploring topic structure: Coherence, diversity and relatedness UvA-DARE (Digital Academic Repository) Exploring topic structure: Coherence, diversity and relatedness He, J. Link to publication Citation for published version (APA): He, J. (211). Exploring topic structure:

More information

Ranking Clustered Data with Pairwise Comparisons

Ranking Clustered Data with Pairwise Comparisons Ranking Clustered Data with Pairwise Comparisons Alisa Maas ajmaas@cs.wisc.edu 1. INTRODUCTION 1.1 Background Machine learning often relies heavily on being able to rank the relative fitness of instances

More information

5 Learning hypothesis classes (16 points)

5 Learning hypothesis classes (16 points) 5 Learning hypothesis classes (16 points) Consider a classification problem with two real valued inputs. For each of the following algorithms, specify all of the separators below that it could have generated

More information

Move-to-front algorithm

Move-to-front algorithm Up to now, we have looked at codes for a set of symbols in an alphabet. We have also looked at the specific case that the alphabet is a set of integers. We will now study a few compression techniques in

More information

Lecture 5: Information Retrieval using the Vector Space Model

Lecture 5: Information Retrieval using the Vector Space Model Lecture 5: Information Retrieval using the Vector Space Model Trevor Cohn (tcohn@unimelb.edu.au) Slide credits: William Webber COMP90042, 2015, Semester 1 What we ll learn today How to take a user query

More information

An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments

An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments Hui Fang ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign Abstract In this paper, we report

More information

CS54701: Information Retrieval

CS54701: Information Retrieval CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful

More information

Using Query History to Prune Query Results

Using Query History to Prune Query Results Using Query History to Prune Query Results Daniel Waegel Ursinus College Department of Computer Science dawaegel@gmail.com April Kontostathis Ursinus College Department of Computer Science akontostathis@ursinus.edu

More information

Automatic Lemmatizer Construction with Focus on OOV Words Lemmatization

Automatic Lemmatizer Construction with Focus on OOV Words Lemmatization Automatic Lemmatizer Construction with Focus on OOV Words Lemmatization Jakub Kanis, Luděk Müller University of West Bohemia, Department of Cybernetics, Univerzitní 8, 306 14 Plzeň, Czech Republic {jkanis,muller}@kky.zcu.cz

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Lecture (08, 09) Routing in Switched Networks

Lecture (08, 09) Routing in Switched Networks Agenda Lecture (08, 09) Routing in Switched Networks Dr. Ahmed ElShafee Routing protocols Fixed Flooding Random Adaptive ARPANET Routing Strategies ١ Dr. Ahmed ElShafee, ACU Fall 2011, Networks I ٢ Dr.

More information

A probabilistic description-oriented approach for categorising Web documents

A probabilistic description-oriented approach for categorising Web documents A probabilistic description-oriented approach for categorising Web documents Norbert Gövert Mounia Lalmas Norbert Fuhr University of Dortmund {goevert,mounia,fuhr}@ls6.cs.uni-dortmund.de Abstract The automatic

More information

TELCOM2125: Network Science and Analysis

TELCOM2125: Network Science and Analysis School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2015 2 Part 4: Dividing Networks into Clusters The problem l Graph partitioning

More information

An Investigation of Basic Retrieval Models for the Dynamic Domain Task

An Investigation of Basic Retrieval Models for the Dynamic Domain Task An Investigation of Basic Retrieval Models for the Dynamic Domain Task Razieh Rahimi and Grace Hui Yang Department of Computer Science, Georgetown University rr1042@georgetown.edu, huiyang@cs.georgetown.edu

More information

Machine Learning in Biology

Machine Learning in Biology Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant

More information

Combined Weak Classifiers

Combined Weak Classifiers Combined Weak Classifiers Chuanyi Ji and Sheng Ma Department of Electrical, Computer and System Engineering Rensselaer Polytechnic Institute, Troy, NY 12180 chuanyi@ecse.rpi.edu, shengm@ecse.rpi.edu Abstract

More information

CMPSCI 646, Information Retrieval (Fall 2003)

CMPSCI 646, Information Retrieval (Fall 2003) CMPSCI 646, Information Retrieval (Fall 2003) Midterm exam solutions Problem CO (compression) 1. The problem of text classification can be described as follows. Given a set of classes, C = {C i }, where

More information

Constructively Learning a Near-Minimal Neural Network Architecture

Constructively Learning a Near-Minimal Neural Network Architecture Constructively Learning a Near-Minimal Neural Network Architecture Justin Fletcher and Zoran ObradoviC Abetract- Rather than iteratively manually examining a variety of pre-specified architectures, a constructive

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval SCCS414: Information Storage and Retrieval Christopher Manning and Prabhakar Raghavan Lecture 10: Text Classification; Vector Space Classification (Rocchio) Relevance

More information

Synergy Of Clustering Multiple Back Propagation Networks

Synergy Of Clustering Multiple Back Propagation Networks 650 Lincoln and Skrzypek Synergy Of Clustering Multiple Back Propagation Networks William P. Lincoln* and Josef Skrzypekt UCLA Machine Perception Laboratory Computer Science Department Los Angeles CA 90024

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 3 Retrieval Evaluation Retrieval Performance Evaluation Reference Collections CFC: The Cystic Fibrosis Collection Retrieval Evaluation, Modern Information Retrieval,

More information

A Hybrid Recursive Multi-Way Number Partitioning Algorithm

A Hybrid Recursive Multi-Way Number Partitioning Algorithm Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence A Hybrid Recursive Multi-Way Number Partitioning Algorithm Richard E. Korf Computer Science Department University

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

An Automatic Reply to Customers Queries Model with Chinese Text Mining Approach

An Automatic Reply to Customers  Queries Model with Chinese Text Mining Approach Proceedings of the 6th WSEAS International Conference on Applied Computer Science, Hangzhou, China, April 15-17, 2007 71 An Automatic Reply to Customers E-mail Queries Model with Chinese Text Mining Approach

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

Content-based Dimensionality Reduction for Recommender Systems

Content-based Dimensionality Reduction for Recommender Systems Content-based Dimensionality Reduction for Recommender Systems Panagiotis Symeonidis Aristotle University, Department of Informatics, Thessaloniki 54124, Greece symeon@csd.auth.gr Abstract. Recommender

More information

Some Applications of Graph Bandwidth to Constraint Satisfaction Problems

Some Applications of Graph Bandwidth to Constraint Satisfaction Problems Some Applications of Graph Bandwidth to Constraint Satisfaction Problems Ramin Zabih Computer Science Department Stanford University Stanford, California 94305 Abstract Bandwidth is a fundamental concept

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS J.I. Serrano M.D. Del Castillo Instituto de Automática Industrial CSIC. Ctra. Campo Real km.0 200. La Poveda. Arganda del Rey. 28500

More information

Notes on Multilayer, Feedforward Neural Networks

Notes on Multilayer, Feedforward Neural Networks Notes on Multilayer, Feedforward Neural Networks CS425/528: Machine Learning Fall 2012 Prepared by: Lynne E. Parker [Material in these notes was gleaned from various sources, including E. Alpaydin s book

More information

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Daniel Lowd January 14, 2004 1 Introduction Probabilistic models have shown increasing popularity

More information

number of documents in global result list

number of documents in global result list Comparison of different Collection Fusion Models in Distributed Information Retrieval Alexander Steidinger Department of Computer Science Free University of Berlin Abstract Distributed information retrieval

More information

Chapter 6 Continued: Partitioning Methods

Chapter 6 Continued: Partitioning Methods Chapter 6 Continued: Partitioning Methods Partitioning methods fix the number of clusters k and seek the best possible partition for that k. The goal is to choose the partition which gives the optimal

More information

Information Retrieval CSCI

Information Retrieval CSCI Information Retrieval CSCI 4141-6403 My name is Anwar Alhenshiri My email is: anwar@cs.dal.ca I prefer: aalhenshiri@gmail.com The course website is: http://web.cs.dal.ca/~anwar/ir/main.html 5/6/2012 1

More information

Scalable Trigram Backoff Language Models

Scalable Trigram Backoff Language Models Scalable Trigram Backoff Language Models Kristie Seymore Ronald Rosenfeld May 1996 CMU-CS-96-139 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 This material is based upon work

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task Walid Magdy, Gareth J.F. Jones Centre for Next Generation Localisation School of Computing Dublin City University,

More information

Report on the TREC-4 Experiment: Combining Probabilistic and Vector-Space Schemes

Report on the TREC-4 Experiment: Combining Probabilistic and Vector-Space Schemes Report on the TREC-4 Experiment: Combining Probabilistic and Vector-Space Schemes Jacques Savoy, Melchior Ndarugendamwo, Dana Vrajitoru Faculté de droit et des sciences économiques Université de Neuchâtel

More information

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate Searching Information Servers Based on Customized Proles Technical Report USC-CS-96-636 Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California

More information

Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets

Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets 2016 IEEE 16th International Conference on Data Mining Workshops Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets Teruaki Hayashi Department of Systems Innovation

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann Search Engines Chapter 8 Evaluating Search Engines 9.7.2009 Felix Naumann Evaluation 2 Evaluation is key to building effective and efficient search engines. Drives advancement of search engines When intuition

More information

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16 Federated Search Jaime Arguello INLS 509: Information Retrieval jarguell@email.unc.edu November 21, 2016 Up to this point... Classic information retrieval search from a single centralized index all ueries

More information

Reducing Redundancy with Anchor Text and Spam Priors

Reducing Redundancy with Anchor Text and Spam Priors Reducing Redundancy with Anchor Text and Spam Priors Marijn Koolen 1 Jaap Kamps 1,2 1 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 2 ISLA, Informatics Institute, University

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College

More information

Empirical risk minimization (ERM) A first model of learning. The excess risk. Getting a uniform guarantee

Empirical risk minimization (ERM) A first model of learning. The excess risk. Getting a uniform guarantee A first model of learning Let s restrict our attention to binary classification our labels belong to (or ) Empirical risk minimization (ERM) Recall the definitions of risk/empirical risk We observe the

More information