An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst

Size: px
Start display at page:

Download "An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst"

Transcription

1 An Evaluation of Information Retrieval Accuracy with Simulated OCR Output W.B. Croft y, S.M. Harding y, K. Taghva z, and J. Borsack z y Computer Science Department University of Massachusetts, Amherst z Information Science Research Institute University of Nevada, Las Vegas Abstract Optical Character Recognition (OCR) is a critical part of many text-based applications. Although some commercial systems use the output from OCR devices to index documents without editing, there is very little quantitative data on the impact of OCR errors on the accuracy of a text retrieval system. Because of the diculty of constructing test collections to obtain this data, we have carried out evaluations using simulated OCR output on a variety of databases. The results show that high quality OCR devices have little eect on the accuracy of retrieval, but low quality devices used with databases of short documents can result in signicant degradation. 1 Introduction Text-based information systems have become increasingly important in business, government, and academia. In many applications, the source of the text is not documents from word processors, but instead documents in their original paper form. Although imaging systems provide a simple means of storing these documents and retrieving them through manually assigned keywords, full-text access will in general be much more eective. In order to get from paper documents to full-text retrieval, OCR will be a crucial part of the process. For printed documents, OCR techniques can recognize words with a high level of accuracy. The number of errors, however, is such that a substantial amount of human editing is required to make the text output suitable for archival and display. The cost of this editing is a major factor in most current OCR applications. For applications that focus on automatic indexing and retrieval, it is possible that the raw word accuracy of the OCR output may be suf- cient, and that expensive editing can be avoided. Some text retrieval systems have taken this approach, combining OCR for indexing and imaging for display. From an information retrieval point of view, the main issue is the impact of OCR indexing errors on the accuracy or eectiveness of the system. The accuracy of an IR system is typically measured using precision and recall 1 with a test collection 1 Precision is the percentage of retrieved documents that are relevant, and recall is the percentage of relevant documents that are retrieved, for a

2 consisting of a document database, queries, and relevance judgements for those queries [6]. Despite the fact that there are commercial retrieval systems that use OCR input, the lack of availability of test collections means that there is very little published data about the eect on retrieval accuracy. In a recent study, Taghva, Borsack, Condit and Erva [8] did a comparison of the output of a retrieval system using a document database created using scanning and OCR, and the same database with errors removed by editing. The comparison was done by comparing the overlap of the retrieved documents for a set of test queries. The results showed that the output was very similar, but the study was limited by the small size of the database, the lack of relevance judgements, and the use of a Boolean logic retrieval system. What is really needed is data showing the eect on recall and precision of OCR indexing with a range of databases, and with a retrieval system that produces ranked output. Ranking systems have clear advantages relative to Boolean logic systems in terms of average eectiveness, and can use simple query formulations without Boolean operators. The fact that they are based on partial matching may in fact make them less susceptible to OCR errors. The problem with obtaining this data is that it is extremely expensive to build test collections, and even more expensive to build them for OCR experiments. In this paper, we describe our rst approach to obtaining accuracy data using simulated OCR output for a range of databases. The simulation is done using data about word error rates for a variety of devices tested at the UNLV Information Science Research Institute (ISRI) [5]. Although the simulation is not completely accurate, it is the rst study about OCR and retrieval eectiveness where the results have some basis particular query. on actual OCR data. In the next section, we describe how the simulation was done. The third section gives details on the test collections used, their characteristics, and the experiments that were performed. The results of the experiments are summarized in the fourth section, and the nal section suggests future directions for this work. 2 The OCR Simulation The data that was used for the simulation was a study of character and word error rates for a range of OCR devices and software [5]. The study was done using a sample of 460 document pages from a Department of Energy test database. The word error rates that were reported in this study are not uniformly distributed throughout the document. In fact, error rates are summarized by device, by page type, by word type, and by word length. The word types distinguished were stopwords and non-stopwords, where stopwords are simple function words such as \and", \the", \of". Pages were divided into groups based on the number of OCR errors on them. Some pages, presumably those with high-quality initial images, had virtually no errors on them, whereas others, which either had poor quality originals or poor quality scans, had large numbers of errors. Statistics were reported for the percentage of pages in each group, and for word error rates by word length within each page group. Table 1 shows some of this data. To produce the simulated OCR test collections, we assumed that the statistics reported in this study would apply to all the document types in the collections we used. We also assumed that word length and type (stopword or non-stopword) were the only factors in determining the chance of an OCR error in a particular page group. A renement would be to give higher prob- An Evaluation of Information Retrieval Accuracy with Simulated OCR Output

3 Table 1: Page quality groups dened for simulating OCR error rates on text retrieval performance. Average accuracy by page group for the two OCR systems used as the basis for the simulations are in nal two columns (OCR1 and OCR2). The standard page size used for the simulation runs was 1778 characters/page. Page Quality Number of Number of Accuracy Accuracy Group Pages Characters OCR1 (%) OCR2 (%) , , , , , Total ,946 abilities of error to those words which contain character strings that are commonly confused by OCR devices. Some data about these common confusions is available, but we decided to ignore this factor in our initial experiments. Two other important assumptions were made. The rst is that all OCR errors result in a corrupted word that is discarded and not indexed. In actual OCR data, valid words are sometimes transformed by errors into other valid words, although it is unlikely with longer words and dicult to simulate accurately. More importantly, OCR errors would typically result in words that are similar to spelling errors that would be indexed. Because the retrieval process is driven by matches with un-corrupted query words, there not much dierence between discarding a word and generating a misspelling. There are, however, some eects. The indexes generated in a real OCR environment will contain a large number of these \misspelled" words. When data is presented on index sizes in this study, those words are ignored. In addition, there is some chance that a commonly misrecognized word will aect the term frequencies that are used to normalize probability estimates in the retrieval system. This is more likely to happen with incorrect zoning, which leads us to the next assumption. In this study, we ignore errors caused by incorrect zoning, that is, attempting to do OCR on gures, maps, etc. A recent study has shown that the type of OCR errors generated by incorrect zoning can have an impact on retrieval performance [7]. In other words, the zoning accuracy of an OCR system is an important factor in determining retrieval performance independent of the word accuracy of the system. This happens because zoning errors can generate large numbers of \misspelled" words that aect a retrieval system's probability estimates. The study reported here focuses on the impact of the word accuracy rates of the OCR system. To generate a simulated OCR database, then, an IR test database is indexed using standard techniques such as tokenization, stemming, and stopword removal [6, 1]. During this process, the text of a document is randomly assigned to page groups, and index words are randomly discarded according to the error rates for that page Croft, Harding, Taghva and Borsack

4 group and word length. The result of the OCR simulation is a database in which documents may be indexed by fewer terms than the original database. More specically, two sets of statistics were used, representing the best and the worst OCR performance observed in the UNLV tests. The ve page group classes, representing dierent levels of page quality, were assigned randomly to the text input stream during the indexing process. The page group and the corresponding set of character recognition error rates remained in eect for the duration of a page. The probability of being in any particular page group was determined from the total number of characters for the page group, divided by the total number of characters for all page groups, which was close to 1 in 5, but not exactly so. Page size was a constant and determined from a calculation dividing the total number of characters in the data set by the total number of pages. The character counts for each page group were a part of the UNLV data. A random number generator producing values between 0 and 1 determined page group assignment when a page full of characters had been read. Simulation of OCR word errors was done by a randomly assigned number between 0 and 1 reecting the probability of error for a word of its length and page group. If the number fell in the error range, it was discarded, otherwise, processed as usual. Word positions, which are used in proximity operators, were counted whether discarded or not. The results of this process on four test collections are given in the next section. 3 The Experiments The experiments were done using the IN- QUERY information retrieval system developed at the University of Massachusetts [2]. INQUERY is based on a probabilistic model of retrieval, has a number of advanced features, and has consistently achieved excellent results at the ARPA-sponsored TREC and TIPSTER evaluations (see [4] for an overview of the TREC evaluation). For the purposes of these experiments, the main features of INQUERY are that it does automatic indexing and produces ranked lists of documents in response to a query. These are features that are common to many recent information retrieval systems. All ranking systems use weighting or estimation functions to determine the relative importance of words in the query and document. As the results discussed later show, the form of these estimation functions can be important in an OCR environment. Four test collections were used in these experiments. The collections were selected to represent a range of sources, document sizes, and query sizes. The CACM collection is a small collection of Computer Science Abstracts [3] and has been a standard benchmark for a number of years. NPL is a larger collection of short documents and short queries that has been used in a variety of IR experiments. WEST is a collection of long, full-text, legal information, specically case law. The WSJ collection is the largest number of documents, which are moderate length, full-text articles from the Wall St. Journal. The WSJ queries are also the longest of any collection. The WSJ collection is a subset of the TIPSTER collection described in [4]. In general, we would expect OCR errors to have more impact on the collections of short documents, since long documents would have much more redundant information. This is one of the factors that is tested in the experiments. Table 2 gives statistics for two of the collections (NPL and WEST) showing the number of word tokens assigned to each page quality group and the number of OCR errors generated in each group. Note that An Evaluation of Information Retrieval Accuracy with Simulated OCR Output

5 Table 2: Summary of total words and OCR errors generated for two test collections, NPL and WEST. Numbers represent word tokens encountered for STD (no OCR errors), and worst and best OCR simulations (OCR1 and OCR2) from UNLV tests. These collections contain the shortest and longest average documents respectively. NPL Page Group Total Total Errors Total Errors 1 479,163 96,815 1,254 96, ,237 4,569 95,926 1, ,562 8,737 94,772 2, ,293 16,714 95,508 5, ,269 40,370 96,291 13,993 WEST Page Group Total Total Errors Total Errors 1 39,549,976 7,984,570 92,881 7,979,432 35, ,887, ,866 7,881, , ,847, ,372 7,849, , ,882,217 1,336,320 7,886, , ,948,286 3,190,063 7,953,203 1,147,189 Croft, Harding, Taghva and Borsack

6 Table 3: Summary statistics for the three versions of four collections used to evaluate the eect of OCR errors on retrieval performance. STD refers to the original collection. OCR1, OCR2 are the worst and the best OCR systems, respectively, from UNLV tests. The dictionary term counts represent the number of unique word stems in the version dictionary. All indexed terms are the number of word stems encountered during the indexing of the text excluding stopwords. Collection Collection Document Average Size Cnt Chars/Doc CACM 1,639,440 3, NPL 3,748,316 11, WEST 297,501,776 11,953 24,889 WSJ 279,249,494 98,735 2,828 Collection Dictionary Terms All Indexed Terms CACM 5,998 5,644 5, ,294 96, ,386 NPL 7,689 7,144 7, , , ,258 WEST 155, , ,891 22,817,834 19,353,353 21,830,212 WSJ 197, , ,508 24,454,116 20,797,586 23,448,131 Table 4: Statistics on standard query sets for each of four collections used to evaluate OCR errors on retrieval performance. Collection Total Queries Number of Words/Query Average Unique Min Mean Max Words/Query CACM NPL WEST WSJ An Evaluation of Information Retrieval Accuracy with Simulated OCR Output

7 these numbers do not represent indexed terms. Many of the words were stopwords and thus discarded. Also, many of these word forms were conated to single unique indexed stems, as stemming was used in the indexing runs. Table 3 shows the results of the OCR simulation on the indexing of all four collections. It gives the gures for the original collection and the two OCR simulations. It should be noted again that \misspelled" words generated by OCR errors are not included in these gures. The table shows that the number of unique terms is reduced considerably in the case of OCR1 (consistently about 7%) and much less in the case of OCR2. From this we would certainly expect to see more impact on the retrieval performance of OCR1. Table 4 gives the statistics for the queries associated with these collections. The main feature here is the length of the Wall St Journal queries. Long queries are another form of redundancy that may oset the eect of OCR errors. From this point of view, the NPL collection has the worst combination of characteristics in that it has both short documents and short queries. We should emphasize, however, that the error generation process is only applied to the document texts, not the queries. 4 Summary of Results The following tables show the results of the retrieval experiments using the three versions of each of the four test collections. Table 5 gives the overall results using the average precision over all recall levels. The CACM entry is a result for one query set. Combined results for 100 runs using four query sets showed average -6.4 and -1.1 percent performance degradation in retrieval performance between standard and OCR1 and OCR2 simulations. The results appear to support the view that collections with short documents and short queries will be aected the most by OCR errors. The collection with the biggest degradation in average precision is NPL. This is also the only collection where the better OCR system (OCR2) caused a significant loss in precision compared to the original collection. The CACM collection, which also has many short documents, had the next largest degradation in performance. The WEST collection, which has very large documents, had the lowest degradation for both OCR systems. From these results, it can also be concluded that using the best OCR system for input to a text retrieval system will generally not signicantly aect retrieval performance for databases with long documents. This conclusion is with respect to word errors other than those generated by zoning errors. It is worth mentioning here that the general \rule of thumb" used in IR experiments is that a change in average precision of less than 5% is not signicant, and a change of around 10% is very signicant. For the NPL collection, then, even the best OCR input resulted in a signicant loss in performance. In order to look at these results in more depth, tables 5 through 9 contain standard recall-precision tables, which show the average precision gures at standard recall points. These tables show that the highest losses in accuracy generally occur at higher recall levels (i.e. further down the ranking). This is what would be expected in that documents which contain many query terms will be less aected by the loss of one of those terms, and these are typically the terms at the top end of the ranking. To study the eect of random variation, we did a large number of retrieval runs for the CACM collection. The only factor that varied between these runs was the random eect of the OCR errors. Table 10 shows that although performance degradations are generally consistent, occasional runs can result in performance im- Croft, Harding, Taghva and Borsack

8 Table 5: Retrieval performance for four standard text collections showing eects of two levels of simulated OCR error rates. Values are average precision over 10 standard recall points from 10 to 100 percent. Percentage dierences are given in parentheses. Results for CACM are for one of four query sets. Average performance loss for 100 simulation runs for CACM was -6.4 and -1.1 percent for OCR1 and OCR2 respectively. Collection Average Precision CACM (-6.9%) 34.3 (-1.7%) NPL (-10.1%) 23.5 (-9.1%) WEST (-4.0%) 48.0 (-0.4%) WSJ (-4.5%) 39.3 (-1.5%) Table 6: The standard recall-precision table for the CACM collection for one of 25 runs using query set 2. Recall Precision (93 queries) (- 3.3) 70.4 (+5.3) (- 0.9) 55.8 (+4.4) (- 3.1) 46.9 (-0.3) (- 6.5) 41.0 (+2.5) (-13.3) 34.4 (-0.7) (-14.8) 26.6 (-7.5) (-14.5) 19.3 (-5.3) (-22.0) 15.5 (-2.3) (-29.9) 10.1 (-5.4) (-33.8) 7.3 (-8.8) avg (- 8.6) 32.7 (+0.5) An Evaluation of Information Retrieval Accuracy with Simulated OCR Output

9 Table 7: The standard recall-precision table for the NPL collection. Recall Precision (93 queries) ( -8.1) 55.8 ( -2.9) ( -5.2) 46.0 ( -5.2) (-12.9) 35.2 (-12.8) (-16.1) 29.2 (-12.1) (-12.6) 22.5 (-14.1) (-11.1) 16.5 ( -9.0) (-10.2) 12.2 (-11.1) ( -7.9) 9.5 ( -9.1) (-10.5) 5.2 (-22.7) ( -1.5) 2.8 (-22.1) avg (-10.1) 23.5 ( -9.1) Table 8: The standard recall-precision table for the WEST collection. Recall Precision (34 queries) (-1.4) 77.9 (-0.3) (-1.6) 73.8 (+0.0) (-2.3) 71.8 (-0.2) (-5.0) 61.6 (-0.6) (-5.3) 54.9 (+0.0) (-4.2) 44.7 (-1.3) (-5.7) 37.2 (-0.4) (-4.0) 29.1 (-2.0) (-8.9) 17.9 (+0.1) (-21.7) 10.7 (+0.4) avg (-4.0) 48.0 (-0.4) Croft, Harding, Taghva and Borsack

10 Table 9: The standard recall-precision table for the WSJ collection. Recall Precision (50 queries) (-0.7) 67.5 (-1.0) (+0.1) 60.4 (+0.2) (-0.9) 53.3 (-0.6) (-2.3) 47.4 (-1.6) (-4.7) 42.2 (+0.4) (-7.1) 37.3 (-1.3) (-8.4) 32.2 (-1.9) (-13.9) 26.2 (-4.4) (-14.9) 18.9 (-5.0) (-17.5) 7.6 (-12.6) avg (-4.5) 39.3 (-1.5) provements, even with OCR1. Signicant changes between runs, as occurs sometimes in OCR1, are more likely to happen with small collections where the recall and precision for a particular query can be significantly aected by changes to just a few documents. There are two ways in which discarding terms at random can improve retrieval performance. One is that the documents that were penalized by the OCR errors were documents that contained query terms but were not relevant. Making those documents hard, or even impossible, to retrieve results in better performance. Another signicant, and more common, effect is that OCR errors can change the frequencies used to calculate the relative importance of words. The most important of these is the maximum frequency for any word in a particular document. This frequency is used for normalization and changes to this frequency can have signicant results, particularly in collections with small document sizes. We are currently developing new estimation techniques for word importance that are much less sensitive to the types of errors introduced by OCR. 5 Future Work The simulations described above could be made more accurate by taking into account which characters are commonly confused by OCR devices. By using knowledge of what types of characters are generated in error, we could also attempt to simulate the generation of valid index terms and the generation of misspelled words. The most dicult aspect of the simulation to improve would be to generate errors arising from poor zoning. In most real applications of OCR to archiving and retrieving information, zoning will be an important issue. The value of the experiments given here is to give some quantitative data that is reasonably accurate. This data shows that even though the output of the best OCR devices can be adequate for automatic indexing and retrieval of databases of longer full-text documents, for collections of very short documents, OCR errors can have a signicant impact on retrieval performance. The most important types of errors will not be random, but rather when a relevant document is made unretrievable by poor quality scanning or word recognition. Regard- An Evaluation of Information Retrieval Accuracy with Simulated OCR Output

11 Table 10: Average precision results at 10 standard recall levels for each of 25 repeated indexing runs using CACM query set 2. Numbers in parentheses represents percent dierence with standard collection. Run CACM query set (-8.6) 32.7 (+0.5) (-9.2) 32.6 (-0.1) (-11.7) 32.7 (+0.4) (+1.4) 32.5 (-0.2) (-8.0) 32.8 (+0.7) (-4.2) 31.6 (-2.9) (-7.4) 31.2 (-4.2) (-5.9) 32.5 (-0.2) (-3.8) 32.2 (-1.0) (-8.8) 32.4 (-0.6) (-7.3) 31.8 (-2.4) (-8.1) 32.5 (-0.3) (-6.7) 31.6 (-3.1) (-7.7) 32.9 (+0.9) (+0.2) 32.0 (-1.7) (-8.3) 31.7 (-2.6) (-10.6) 32.3 (-1.0) (-8.6) 33.3 (+2.3) (-8.8) 32.7 (+0.4) (-7.0) 32.4 (-0.4) (-8.4) 31.6 (-3.1) (-7.9) 32.5 (-0.2) (-6.9) 33.1 (+1.6) (-4.9) 31.7 (-2.9) (-3.2) 32.5 (-0.3) Croft, Harding, Taghva and Borsack

12 less of the source of the errors, the results suggest the need for more research on automatic correction schemes, even when the OCR output is not needed for display. As mentioned above, the other area of research that is being pursued is to develop estimation functions for the retrieval system that are less sensitive to OCR errors. In general, however, it does appear that ranking systems provide a degree of robustness in an environment where documents contain errors. Acknowledgments This research was supported in part by the NSF Center for Intelligent Information Retrieval at the University of Massachusetts, Amherst, and by the Information Science Research Institute, University of Nevada, Las Vegas. ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 36{47, [5] S. Rice, J. Kanai, and T. Nartker. An evaluation of OCR accuracy. In UNLV Information Science Research Institute Annual Report, pages 9{20, [6] Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, [7] K. Taghva. Results of applying probabilistic ir to ocr text. pages 1{20, [8] K. Taghva, J. Borsack, A. Condit, and S. Erva. The eects of noisy data on text retrieval. In UNLV Information Science Research Institute Annual Report, pages 71{80, References [1] N. J. Belkin and W.B. Croft. Information ltering and information retrieval: Two sides of the same coin. Communications of the ACM, 35(12):29{38, [2] J. P. Callan, W. B. Croft, and S. M. Harding. The INQUERY retrieval system. In Proceedings of the Third International Conference on Database and Expert Systems Applications, pages 78{ 83, Valencia, Spain, Springer- Verlag. [3] Edward A. Fox. Characterization of two new experimental collections in computer and information science containing textual and bibliographic concepts. Technical Report , Department of Computer Science, Cornell University, Ithaca, NY 14853, September [4] D. Harman. Overview of the rst TREC conference. In Proceedings of the 16 th An Evaluation of Information Retrieval Accuracy with Simulated OCR Output

TREC-3 Ad Hoc Retrieval and Routing. Experiments using the WIN System. Paul Thompson. Howard Turtle. Bokyung Yang. James Flood

TREC-3 Ad Hoc Retrieval and Routing. Experiments using the WIN System. Paul Thompson. Howard Turtle. Bokyung Yang. James Flood TREC-3 Ad Hoc Retrieval and Routing Experiments using the WIN System Paul Thompson Howard Turtle Bokyung Yang James Flood West Publishing Company Eagan, MN 55123 1 Introduction The WIN retrieval engine

More information

Amit Singhal, Chris Buckley, Mandar Mitra. Department of Computer Science, Cornell University, Ithaca, NY 14853

Amit Singhal, Chris Buckley, Mandar Mitra. Department of Computer Science, Cornell University, Ithaca, NY 14853 Pivoted Document Length Normalization Amit Singhal, Chris Buckley, Mandar Mitra Department of Computer Science, Cornell University, Ithaca, NY 8 fsinghal, chrisb, mitrag@cs.cornell.edu Abstract Automatic

More information

Siemens TREC-4 Report: Further Experiments with Database. Merging. Ellen M. Voorhees. Siemens Corporate Research, Inc.

Siemens TREC-4 Report: Further Experiments with Database. Merging. Ellen M. Voorhees. Siemens Corporate Research, Inc. Siemens TREC-4 Report: Further Experiments with Database Merging Ellen M. Voorhees Siemens Corporate Research, Inc. Princeton, NJ ellen@scr.siemens.com Abstract A database merging technique is a strategy

More information

2. PRELIMINARIES MANICURE is specically designed to prepare text collections from printed materials for information retrieval applications. In this ca

2. PRELIMINARIES MANICURE is specically designed to prepare text collections from printed materials for information retrieval applications. In this ca The MANICURE Document Processing System Kazem Taghva, Allen Condit, Julie Borsack, John Kilburg, Changshi Wu, and Je Gilbreth Information Science Research Institute University of Nevada, Las Vegas ABSTRACT

More information

1 Introduction The history of information retrieval may go back as far as According to Maron[7], 1948 signies three important events. The rst is

1 Introduction The history of information retrieval may go back as far as According to Maron[7], 1948 signies three important events. The rst is The MANICURE Document Processing System Kazem Taghva, Allen Condit, Julie Borsack, John Kilburg, Changshi Wu, and Je Gilbreth Technical Report 95-02 Information Science Research Institute University of

More information

The Effects of Noisy Data on Text Retrieval

The Effects of Noisy Data on Text Retrieval The Effects of Noisy Data on Text Retrieval Kazem Taghva, Julie Borsack, Allen Condit, and Srinivas Erva Information Science Research Institute University of Nevada, Las Vegas May 1993 Abstract We report

More information

The Impact of Running Headers and Footers on Proximity Searching

The Impact of Running Headers and Footers on Proximity Searching The Impact of Running Headers and Footers on Proximity Searching Kazem Taghva, Julie Borsack, Tom Nartker, Jeffrey Coombs, Ron Young Information Science Research Institute University of Nevada, Las Vegas

More information

30000 Documents

30000 Documents Document Filtering With Inference Networks Jamie Callan Computer Science Department University of Massachusetts Amherst, MA 13-461, USA callan@cs.umass.edu Abstract Although statistical retrieval models

More information

OCR correction based on document level knowledge

OCR correction based on document level knowledge OCR correction based on document level knowledge T. Nartker, K. Taghva, R. Young, J. Borsack, and A. Condit UNLV/Information Science Research Institute, Box 4021 4505 Maryland Pkwy, Las Vegas, NV USA 89154-4021

More information

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate Searching Information Servers Based on Customized Proles Technical Report USC-CS-96-636 Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California

More information

2 Partitioning Methods for an Inverted Index

2 Partitioning Methods for an Inverted Index Impact of the Query Model and System Settings on Performance of Distributed Inverted Indexes Simon Jonassen and Svein Erik Bratsberg Abstract This paper presents an evaluation of three partitioning methods

More information

Using Statistical Properties of Text to Create. Metadata. Computer Science and Electrical Engineering Department

Using Statistical Properties of Text to Create. Metadata. Computer Science and Electrical Engineering Department Using Statistical Properties of Text to Create Metadata Grace Crowder crowder@cs.umbc.edu Charles Nicholas nicholas@cs.umbc.edu Computer Science and Electrical Engineering Department University of Maryland

More information

Information Retrieval Research

Information Retrieval Research ELECTRONIC WORKSHOPS IN COMPUTING Series edited by Professor C.J. van Rijsbergen Jonathan Furner, School of Information and Media Studies, and David Harper, School of Computer and Mathematical Studies,

More information

1 Introduction In recent years, optical character recognition (OCR) vendors have made vast improvements in their commercial products. Some vendors cla

1 Introduction In recent years, optical character recognition (OCR) vendors have made vast improvements in their commercial products. Some vendors cla The Eects of Noisy Data on Text Retrieval Kazem Taghva, Julie Borsack, Allen Condit, and Srinivas Erva Information Science Research Institute University of Nevada, Las Vegas September 1993 Abstract We

More information

James P. Callan and W. Bruce Croft. seven elds describing aspects of the information need: the information need that is related to, but often distinct

James P. Callan and W. Bruce Croft. seven elds describing aspects of the information need: the information need that is related to, but often distinct An Evaluation of Query Processing Strategies Using the TIPSTER Collection James P. Callan and W. Bruce Croft Computer Science Department University of Massachusetts, Amherst, MA 01003, USA callan@cs.umass.edu,

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Wolfgang Tannebaum, Parvaz Madabi and Andreas Rauber Institute of Software Technology and Interactive Systems, Vienna

More information

A Recursive Coalescing Method for Bisecting Graphs

A Recursive Coalescing Method for Bisecting Graphs A Recursive Coalescing Method for Bisecting Graphs The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters. Citation Accessed Citable

More information

CS54701: Information Retrieval

CS54701: Information Retrieval CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful

More information

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany.

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany. Routing and Ad-hoc Retrieval with the TREC-3 Collection in a Distributed Loosely Federated Environment Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers University of Dortmund, Germany

More information

AT&T at TREC-6. Amit Singhal. AT&T Labs{Research. Abstract

AT&T at TREC-6. Amit Singhal. AT&T Labs{Research. Abstract AT&T at TREC-6 Amit Singhal AT&T Labs{Research singhal@research.att.com Abstract TREC-6 is AT&T's rst independent TREC participation. We are participating in the main tasks (adhoc, routing), the ltering

More information

Performance Measures for Multi-Graded Relevance

Performance Measures for Multi-Graded Relevance Performance Measures for Multi-Graded Relevance Christian Scheel, Andreas Lommatzsch, and Sahin Albayrak Technische Universität Berlin, DAI-Labor, Germany {christian.scheel,andreas.lommatzsch,sahin.albayrak}@dai-labor.de

More information

The Effectiveness of Thesauri-Aided Retrieval

The Effectiveness of Thesauri-Aided Retrieval The Effectiveness of Thesauri-Aided Retrieval Kazem Taghva, Julie Borsack, and Allen Condit Information Science Research Institute University of Nevada, Las Vegas Las Vegas, NV 89154-4021 USA ABSTRACT

More information

TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University

TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University of Maryland, College Park, MD 20742 oard@glue.umd.edu

More information

Document Image Restoration Using Binary Morphological Filters. Jisheng Liang, Robert M. Haralick. Seattle, Washington Ihsin T.

Document Image Restoration Using Binary Morphological Filters. Jisheng Liang, Robert M. Haralick. Seattle, Washington Ihsin T. Document Image Restoration Using Binary Morphological Filters Jisheng Liang, Robert M. Haralick University of Washington, Department of Electrical Engineering Seattle, Washington 98195 Ihsin T. Phillips

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

Implementations of Dijkstra's Algorithm. Based on Multi-Level Buckets. November Abstract

Implementations of Dijkstra's Algorithm. Based on Multi-Level Buckets. November Abstract Implementations of Dijkstra's Algorithm Based on Multi-Level Buckets Andrew V. Goldberg NEC Research Institute 4 Independence Way Princeton, NJ 08540 avg@research.nj.nec.com Craig Silverstein Computer

More information

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA A taxonomy of race conditions. D. P. Helmbold, C. E. McDowell UCSC-CRL-94-34 September 28, 1994 Board of Studies in Computer and Information Sciences University of California, Santa Cruz Santa Cruz, CA

More information

Block Addressing Indices for Approximate Text Retrieval. University of Chile. Blanco Encalada Santiago - Chile.

Block Addressing Indices for Approximate Text Retrieval. University of Chile. Blanco Encalada Santiago - Chile. Block Addressing Indices for Approximate Text Retrieval Ricardo Baeza-Yates Gonzalo Navarro Department of Computer Science University of Chile Blanco Encalada 212 - Santiago - Chile frbaeza,gnavarrog@dcc.uchile.cl

More information

A Balanced Term-Weighting Scheme for Effective Document Matching. Technical Report

A Balanced Term-Weighting Scheme for Effective Document Matching. Technical Report A Balanced Term-Weighting Scheme for Effective Document Matching Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building 2 Union Street SE Minneapolis,

More information

Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied

Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied Information Processing and Management 43 (2007) 1044 1058 www.elsevier.com/locate/infoproman Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied Anselm Spoerri

More information

Probabilistic Learning Approaches for Indexing and Retrieval with the. TREC-2 Collection

Probabilistic Learning Approaches for Indexing and Retrieval with the. TREC-2 Collection Probabilistic Learning Approaches for Indexing and Retrieval with the TREC-2 Collection Norbert Fuhr, Ulrich Pfeifer, Christoph Bremkamp, Michael Pollmann University of Dortmund, Germany Chris Buckley

More information

A Model and a Visual Query Language for Structured Text. handle structure. language. These indices have been studied in literature and their

A Model and a Visual Query Language for Structured Text. handle structure. language. These indices have been studied in literature and their A Model and a Visual Query Language for Structured Text Ricardo Baeza-Yates Gonzalo Navarro Depto. de Ciencias de la Computacion, Universidad de Chile frbaeza,gnavarrog@dcc.uchile.cl Jesus Vegas Pablo

More information

Retrieval Evaluation. Hongning Wang

Retrieval Evaluation. Hongning Wang Retrieval Evaluation Hongning Wang CS@UVa What we have learned so far Indexed corpus Crawler Ranking procedure Research attention Doc Analyzer Doc Rep (Index) Query Rep Feedback (Query) Evaluation User

More information

n = 2 n = 1 µ λ n = 0

n = 2 n = 1 µ λ n = 0 A Comparison of Allocation Policies in Wavelength Routing Networks Yuhong Zhu, George N. Rouskas, Harry G. Perros Department of Computer Science, North Carolina State University Abstract We consider wavelength

More information

second_language research_teaching sla vivian_cook language_department idl

second_language research_teaching sla vivian_cook language_department idl Using Implicit Relevance Feedback in a Web Search Assistant Maria Fasli and Udo Kruschwitz Department of Computer Science, University of Essex, Wivenhoe Park, Colchester, CO4 3SQ, United Kingdom fmfasli

More information

Andrew Davenport and Edward Tsang. fdaveat,edwardgessex.ac.uk. mostly soluble problems and regions of overconstrained, mostly insoluble problems as

Andrew Davenport and Edward Tsang. fdaveat,edwardgessex.ac.uk. mostly soluble problems and regions of overconstrained, mostly insoluble problems as An empirical investigation into the exceptionally hard problems Andrew Davenport and Edward Tsang Department of Computer Science, University of Essex, Colchester, Essex CO SQ, United Kingdom. fdaveat,edwardgessex.ac.uk

More information

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND 41 CHAPTER 5 TEXT W. Bruce Croft BACKGROUND Much of the information in digital library or digital information organization applications is in the form of text. Even when the application focuses on multimedia

More information

Retrieval Evaluation

Retrieval Evaluation Retrieval Evaluation - Reference Collections Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, Chapter

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

Noisy Text Clustering

Noisy Text Clustering R E S E A R C H R E P O R T Noisy Text Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-31 I D I A P December 2004 1 IDIAP, CP 592, 1920 Martigny, Switzerland, grangier@idiap.ch 2 IDIAP,

More information

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s] Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of

More information

Development of Search Engines using Lucene: An Experience

Development of Search Engines using Lucene: An Experience Available online at www.sciencedirect.com Procedia Social and Behavioral Sciences 18 (2011) 282 286 Kongres Pengajaran dan Pembelajaran UKM, 2010 Development of Search Engines using Lucene: An Experience

More information

Cheshire II at INEX: Using A Hybrid Logistic Regression and Boolean Model for XML Retrieval

Cheshire II at INEX: Using A Hybrid Logistic Regression and Boolean Model for XML Retrieval Cheshire II at INEX: Using A Hybrid Logistic Regression and Boolean Model for XML Retrieval Ray R. Larson School of Information Management and Systems University of California, Berkeley Berkeley, California,

More information

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014 CMSC 476/676 Information Retrieval Midterm Exam Spring 2014 Name: You may consult your notes and/or your textbook. This is a 75 minute, in class exam. If there is information missing in any of the question

More information

Inference Networks for Document Retrieval. A Dissertation Presented. Howard Robert Turtle. Submitted to the Graduate School of the

Inference Networks for Document Retrieval. A Dissertation Presented. Howard Robert Turtle. Submitted to the Graduate School of the Inference Networks for Document Retrieval A Dissertation Presented by Howard Robert Turtle Submitted to the Graduate School of the University of Massachusetts in partial fulllment of the requirements for

More information

number of documents in global result list

number of documents in global result list Comparison of different Collection Fusion Models in Distributed Information Retrieval Alexander Steidinger Department of Computer Science Free University of Berlin Abstract Distributed information retrieval

More information

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann Search Engines Chapter 8 Evaluating Search Engines 9.7.2009 Felix Naumann Evaluation 2 Evaluation is key to building effective and efficient search engines. Drives advancement of search engines When intuition

More information

Answer Retrieval From Extracted Tables

Answer Retrieval From Extracted Tables Answer Retrieval From Extracted Tables Xing Wei, Bruce Croft, David Pinto Center for Intelligent Information Retrieval University of Massachusetts Amherst 140 Governors Drive Amherst, MA 01002 {xwei,croft,pinto}@cs.umass.edu

More information

A Practical Passage-based Approach for Chinese Document Retrieval

A Practical Passage-based Approach for Chinese Document Retrieval A Practical Passage-based Approach for Chinese Document Retrieval Szu-Yuan Chi 1, Chung-Li Hsiao 1, Lee-Feng Chien 1,2 1. Department of Information Management, National Taiwan University 2. Institute of

More information

characteristic on several topics. Part of the reason is the free publication and multiplication of the Web such that replicated pages are repeated in

characteristic on several topics. Part of the reason is the free publication and multiplication of the Web such that replicated pages are repeated in Hypertext Information Retrieval for Short Queries Chia-Hui Chang and Ching-Chi Hsu Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan 106 E-mail: fchia,

More information

M erg in g C lassifiers for Im p ro v ed In fo rm a tio n R e triev a l

M erg in g C lassifiers for Im p ro v ed In fo rm a tio n R e triev a l M erg in g C lassifiers for Im p ro v ed In fo rm a tio n R e triev a l Anette Hulth, Lars Asker Dept, of Computer and Systems Sciences Stockholm University [hulthi asker]ø dsv.su.s e Jussi Karlgren Swedish

More information

Similarity search in multimedia databases

Similarity search in multimedia databases Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

/$10.00 (c) 1998 IEEE

/$10.00 (c) 1998 IEEE Dual Busy Tone Multiple Access (DBTMA) - Performance Results Zygmunt J. Haas and Jing Deng School of Electrical Engineering Frank Rhodes Hall Cornell University Ithaca, NY 85 E-mail: haas, jing@ee.cornell.edu

More information

From Passages into Elements in XML Retrieval

From Passages into Elements in XML Retrieval From Passages into Elements in XML Retrieval Kelly Y. Itakura David R. Cheriton School of Computer Science, University of Waterloo 200 Univ. Ave. W. Waterloo, ON, Canada yitakura@cs.uwaterloo.ca Charles

More information

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University

More information

Building Test Collections. Donna Harman National Institute of Standards and Technology

Building Test Collections. Donna Harman National Institute of Standards and Technology Building Test Collections Donna Harman National Institute of Standards and Technology Cranfield 2 (1962-1966) Goal: learn what makes a good indexing descriptor (4 different types tested at 3 levels of

More information

Robust Relevance-Based Language Models

Robust Relevance-Based Language Models Robust Relevance-Based Language Models Xiaoyan Li Department of Computer Science, Mount Holyoke College 50 College Street, South Hadley, MA 01075, USA Email: xli@mtholyoke.edu ABSTRACT We propose a new

More information

ARCHITECTURE AND IMPLEMENTATION OF A NEW USER INTERFACE FOR INTERNET SEARCH ENGINES

ARCHITECTURE AND IMPLEMENTATION OF A NEW USER INTERFACE FOR INTERNET SEARCH ENGINES ARCHITECTURE AND IMPLEMENTATION OF A NEW USER INTERFACE FOR INTERNET SEARCH ENGINES Fidel Cacheda, Alberto Pan, Lucía Ardao, Angel Viña Department of Tecnoloxías da Información e as Comunicacións, Facultad

More information

The Effectiveness of a Dictionary-Based Technique for Indonesian-English Cross-Language Text Retrieval

The Effectiveness of a Dictionary-Based Technique for Indonesian-English Cross-Language Text Retrieval University of Massachusetts Amherst ScholarWorks@UMass Amherst Computer Science Department Faculty Publication Series Computer Science 1997 The Effectiveness of a Dictionary-Based Technique for Indonesian-English

More information

Discover the Depths of your Data with iarchives OWR The benefits of Optical Word Recognition

Discover the Depths of your Data with iarchives OWR The benefits of Optical Word Recognition Discover the Depths of your Data with iarchives OWR The benefits of Optical Word Recognition Through unique technological developments, iarchives is continually exceeding the quality and efficiency standards

More information

APPLICATION OF THE FUZZY MIN-MAX NEURAL NETWORK CLASSIFIER TO PROBLEMS WITH CONTINUOUS AND DISCRETE ATTRIBUTES

APPLICATION OF THE FUZZY MIN-MAX NEURAL NETWORK CLASSIFIER TO PROBLEMS WITH CONTINUOUS AND DISCRETE ATTRIBUTES APPLICATION OF THE FUZZY MIN-MAX NEURAL NETWORK CLASSIFIER TO PROBLEMS WITH CONTINUOUS AND DISCRETE ATTRIBUTES A. Likas, K. Blekas and A. Stafylopatis National Technical University of Athens Department

More information

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS 1 WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS BRUCE CROFT NSF Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts,

More information

A Formal Approach to Score Normalization for Meta-search

A Formal Approach to Score Normalization for Meta-search A Formal Approach to Score Normalization for Meta-search R. Manmatha and H. Sever Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst, MA 01003

More information

This paper studies methods to enhance cross-language retrieval of domain-specific

This paper studies methods to enhance cross-language retrieval of domain-specific Keith A. Gatlin. Enhancing Cross-Language Retrieval of Comparable Corpora Through Thesaurus-Based Translation and Citation Indexing. A master s paper for the M.S. in I.S. degree. April, 2005. 23 pages.

More information

Improving Synoptic Querying for Source Retrieval

Improving Synoptic Querying for Source Retrieval Improving Synoptic Querying for Source Retrieval Notebook for PAN at CLEF 2015 Šimon Suchomel and Michal Brandejs Faculty of Informatics, Masaryk University {suchomel,brandejs}@fi.muni.cz Abstract Source

More information

Power Functions and Their Use In Selecting Distance Functions for. Document Degradation Model Validation. 600 Mountain Avenue, Room 2C-322

Power Functions and Their Use In Selecting Distance Functions for. Document Degradation Model Validation. 600 Mountain Avenue, Room 2C-322 Power Functions and Their Use In Selecting Distance Functions for Document Degradation Model Validation Tapas Kanungo y ; Robert M. Haralick y and Henry S. Baird z y Department of Electrical Engineering,

More information

MeDoc Information Broker Harnessing the. Information in Literature and Full Text Databases. Dietrich Boles. Markus Dreger y.

MeDoc Information Broker Harnessing the. Information in Literature and Full Text Databases. Dietrich Boles. Markus Dreger y. MeDoc Information Broker Harnessing the Information in Literature and Full Text Databases Dietrich Boles Markus Dreger y Kai Grojohann z June 17, 1996 Introduction. MeDoc is a two-year project sponsored

More information

The only known methods for solving this problem optimally are enumerative in nature, with branch-and-bound being the most ecient. However, such algori

The only known methods for solving this problem optimally are enumerative in nature, with branch-and-bound being the most ecient. However, such algori Use of K-Near Optimal Solutions to Improve Data Association in Multi-frame Processing Aubrey B. Poore a and in Yan a a Department of Mathematics, Colorado State University, Fort Collins, CO, USA ABSTRACT

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

Progress in Image Analysis and Processing III, pp , World Scientic, Singapore, AUTOMATIC INTERPRETATION OF FLOOR PLANS USING

Progress in Image Analysis and Processing III, pp , World Scientic, Singapore, AUTOMATIC INTERPRETATION OF FLOOR PLANS USING Progress in Image Analysis and Processing III, pp. 233-240, World Scientic, Singapore, 1994. 1 AUTOMATIC INTERPRETATION OF FLOOR PLANS USING SPATIAL INDEXING HANAN SAMET AYA SOFFER Computer Science Department

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Impact of Term Weighting Schemes on Document Clustering A Review

Impact of Term Weighting Schemes on Document Clustering A Review Volume 118 No. 23 2018, 467-475 ISSN: 1314-3395 (on-line version) url: http://acadpubl.eu/hub ijpam.eu Impact of Term Weighting Schemes on Document Clustering A Review G. Hannah Grace and Kalyani Desikan

More information

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task Walid Magdy, Gareth J.F. Jones Centre for Next Generation Localisation School of Computing Dublin City University,

More information

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax: Consistent Logical Checkpointing Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 hone: 409-845-0512 Fax: 409-847-8578 E-mail: vaidya@cs.tamu.edu Technical

More information

size, runs an existing induction algorithm on the rst subset to obtain a rst set of rules, and then processes each of the remaining data subsets at a

size, runs an existing induction algorithm on the rst subset to obtain a rst set of rules, and then processes each of the remaining data subsets at a Multi-Layer Incremental Induction Xindong Wu and William H.W. Lo School of Computer Science and Software Ebgineering Monash University 900 Dandenong Road Melbourne, VIC 3145, Australia Email: xindong@computer.org

More information

Submitted for TAU97 Abstract Many attempts have been made to combine some form of retiming with combinational

Submitted for TAU97 Abstract Many attempts have been made to combine some form of retiming with combinational Experiments in the Iterative Application of Resynthesis and Retiming Soha Hassoun and Carl Ebeling Department of Computer Science and Engineering University ofwashington, Seattle, WA fsoha,ebelingg@cs.washington.edu

More information

[31] T. W. Yan, and H. Garcia-Molina. SIFT { A Tool for Wide-Area Information Dissemination. USENIX

[31] T. W. Yan, and H. Garcia-Molina. SIFT { A Tool for Wide-Area Information Dissemination. USENIX [29] J. Xu, and J. Callan. Eective Retrieval with Distributed Collections. ACM SIGIR Conference, 1998. [30] J. Xu, and B. Croft. Cluster-based Language Models for Distributed Retrieval. ACM SIGIR Conference

More information

Department of. Computer Science. Remapping Subpartitions of. Hyperspace Using Iterative. Genetic Search. Keith Mathias and Darrell Whitley

Department of. Computer Science. Remapping Subpartitions of. Hyperspace Using Iterative. Genetic Search. Keith Mathias and Darrell Whitley Department of Computer Science Remapping Subpartitions of Hyperspace Using Iterative Genetic Search Keith Mathias and Darrell Whitley Technical Report CS-4-11 January 7, 14 Colorado State University Remapping

More information

DCU at FIRE 2013: Cross-Language!ndian News Story Search

DCU at FIRE 2013: Cross-Language!ndian News Story Search DCU at FIRE 2013: Cross-Language!ndian News Story Search Piyush Arora, Jennifer Foster, and Gareth J. F. Jones CNGL Centre for Global Intelligent Content School of Computing, Dublin City University Glasnevin,

More information

Fax: An Alternative to SGML

Fax: An Alternative to SGML Fax: An Alternative to SGML Kenneth W. Church, William A. Gale, Jonathan I. Helfman and David D. Lewis AT&T Bell Laboratories 600 Mountain Ave. Murray Hill, NJ 07974, USA kwc@research.att.com We have argued

More information

A New Measure of the Cluster Hypothesis

A New Measure of the Cluster Hypothesis A New Measure of the Cluster Hypothesis Mark D. Smucker 1 and James Allan 2 1 Department of Management Sciences University of Waterloo 2 Center for Intelligent Information Retrieval Department of Computer

More information

Object classes. recall (%)

Object classes. recall (%) Using Genetic Algorithms to Improve the Accuracy of Object Detection Victor Ciesielski and Mengjie Zhang Department of Computer Science, Royal Melbourne Institute of Technology GPO Box 2476V, Melbourne

More information

Clustering Sequences with Hidden. Markov Models. Padhraic Smyth CA Abstract

Clustering Sequences with Hidden. Markov Models. Padhraic Smyth CA Abstract Clustering Sequences with Hidden Markov Models Padhraic Smyth Information and Computer Science University of California, Irvine CA 92697-3425 smyth@ics.uci.edu Abstract This paper discusses a probabilistic

More information

Graph-based Entity Linking using Shortest Path

Graph-based Entity Linking using Shortest Path Graph-based Entity Linking using Shortest Path Yongsun Shim 1, Sungkwon Yang 1, Hyunwhan Joe 1, Hong-Gee Kim 1 1 Biomedical Knowledge Engineering Laboratory, Seoul National University, Seoul, Korea {yongsun0926,

More information

Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization

Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization Romain Deveaud 1 and Florian Boudin 2 1 LIA - University of Avignon romain.deveaud@univ-avignon.fr

More information

Decomposition. November 20, Abstract. With the electronic storage of documents comes the possibility of

Decomposition. November 20, Abstract. With the electronic storage of documents comes the possibility of Latent Semantic Indexing via a Semi-Discrete Matrix Decomposition Tamara G. Kolda and Dianne P. O'Leary y November, 1996 Abstract With the electronic storage of documents comes the possibility of building

More information

Multimedia Information Systems

Multimedia Information Systems Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive

More information

Improving the Effectiveness of Information Retrieval with Local Context Analysis

Improving the Effectiveness of Information Retrieval with Local Context Analysis Improving the Effectiveness of Information Retrieval with Local Context Analysis JINXI XU BBN Technologies and W. BRUCE CROFT University of Massachusetts Amherst Techniques for automatic query expansion

More information

Scalable Trigram Backoff Language Models

Scalable Trigram Backoff Language Models Scalable Trigram Backoff Language Models Kristie Seymore Ronald Rosenfeld May 1996 CMU-CS-96-139 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 This material is based upon work

More information

Mercure at trec6 2 IRIT/SIG. Campus Univ. Toulouse III. F Toulouse. fbougha,

Mercure at trec6 2 IRIT/SIG. Campus Univ. Toulouse III. F Toulouse.   fbougha, Mercure at trec6 M. Boughanem 1 2 C. Soule-Dupuy 2 3 1 MSI Universite de Limoges 123, Av. Albert Thomas F-87060 Limoges 2 IRIT/SIG Campus Univ. Toulouse III 118, Route de Narbonne F-31062 Toulouse 3 CERISS

More information

Object Recognition Robust under Translation, Rotation and Scaling in Application of Image Retrieval

Object Recognition Robust under Translation, Rotation and Scaling in Application of Image Retrieval Object Recognition Robust under Translation, Rotation and Scaling in Application of Image Retrieval Sanun Srisuky, Rerkchai Fooprateepsiri? and Sahatsawat Waraklang? yadvanced Machine Intelligence Research

More information

Gen := 0. Create Initial Random Population. Termination Criterion Satisfied? Yes. Evaluate fitness of each individual in population.

Gen := 0. Create Initial Random Population. Termination Criterion Satisfied? Yes. Evaluate fitness of each individual in population. An Experimental Comparison of Genetic Programming and Inductive Logic Programming on Learning Recursive List Functions Lappoon R. Tang Mary Elaine Cali Raymond J. Mooney Department of Computer Sciences

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 3 Retrieval Evaluation Retrieval Performance Evaluation Reference Collections CFC: The Cystic Fibrosis Collection Retrieval Evaluation, Modern Information Retrieval,

More information

University of Maryland. fzzj, basili, Empirical studies (Desurvire, 1994) (Jeries, Miller, USABILITY INSPECTION

University of Maryland. fzzj, basili, Empirical studies (Desurvire, 1994) (Jeries, Miller, USABILITY INSPECTION AN EMPIRICAL STUDY OF PERSPECTIVE-BASED USABILITY INSPECTION Zhijun Zhang, Victor Basili, and Ben Shneiderman Department of Computer Science University of Maryland College Park, MD 20742, USA fzzj, basili,

More information

UW Document Image Databases. Document Analysis Module. Ground-Truthed Information DAFS. Generated Information DAFS. Performance Evaluation

UW Document Image Databases. Document Analysis Module. Ground-Truthed Information DAFS. Generated Information DAFS. Performance Evaluation Performance evaluation of document layout analysis algorithms on the UW data set Jisheng Liang, Ihsin T. Phillips y, and Robert M. Haralick Department of Electrical Engineering, University of Washington,

More information

Preliminary results from an agent-based adaptation of friendship games

Preliminary results from an agent-based adaptation of friendship games Preliminary results from an agent-based adaptation of friendship games David S. Dixon June 29, 2011 This paper presents agent-based model (ABM) equivalents of friendshipbased games and compares the results

More information

images (e.g. intensity edges) using the Hausdor fraction. The technique backgrounds. We report some simple recognition experiments in which

images (e.g. intensity edges) using the Hausdor fraction. The technique backgrounds. We report some simple recognition experiments in which Proc. of the European Conference on Computer Vision, pages 536-545, 1996. Object Recognition Using Subspace Methods Daniel P. Huttenlocher, Ryan H. Lilien and Clark F. Olson Department of Computer Science,

More information