Information Retrieval Term Project : Incremental Indexing Searching Engine

Size: px

Start display at page:

Download "Information Retrieval Term Project : Incremental Indexing Searching Engine"

Deirdre Clark
6 years ago
Views:

1 Information Retrieval Term Project : Incremental Indexing Searching Engine Chi-yau Lin r @ntu.edu.tw Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan Abstract To retrieve information is useful to the user, so in this report we propose our method to build our information retrieval system with the lemur toolkit from CMU [1]. The Lemur Toolkit is designed to facilitate research in language modeling and information retrieval, where IR is broadly interpreted to include such technologies as ad hoc and distributed retrieval, cross-language IR, summarization, filtering, and classification. The toolkit supports indexing of large-scale text databases, the construction of simple language models for documents, queries, or sub-collections, and the implementation of retrieval systems based on language models as well as a variety of other retrieval models. 1. Introduction Unfortunately the word information can be very misleading. An information retrieval system does not inform (i.e. change the knowledge of) the user on the subject of his inquiry. It merely informs on the existence (or non-existence) and whereabouts of documents relating to his request.' In this project, we will build an information retrieval system to solve the problem proposed on course web site. Section2 describes the problem we will resolve. Section3 is the approach to the problem defined in section2. Section4 is our experimental result. The performance of our IR system can be viewed from here. Section5 is the final section, we will describe the problem we met and conclude the report. 2. Problem definition The problem we want to solve is how to manage lots of documents, and build a good data structures for IR model over the text to speed up the search. Through this model, we have an efficient search engine to find the pattern that we want in a short time. Five Query topics are retrieved from the Relevance Judgements {301, 302, 304, 306, 307} in Trec6 Ad hoc. Documents are divided into two sets, one is FBIS3 with 236 files, and the other is FBIS4 with 256 files. Design an IR model and do index for FBIS3, and do incremental indexing for FBIS4 to FBIS3. After that, do evaluation for relevance model, and the performance of designed IR model can be viewed from the

2 output files. The output files contain Full Index Time, Incremental Time, Search Time, Average Precision, Precision at R(30%) and Precision at 10 docs. 3. Approach We use the Lemur Toolkit for language modeling and information retrieval to do our term project Incremental Indexing Search Engine. This toolkit was design and written at Carnegie Mellon University and at the University of Massachusetts. This toolkit supports the indexing of large scale text database, the construction of query formulation and the retrieval in the indexing database. All of the features we need to build our Incremental Indexing Search Engine are provided by this toolkit. There are three major steps shown in Figure Parsing Queries In the first step, we need to parse the text query(topics.txt) into the format we can use. We use one of the module in this tool called ParseToFile to do query formulation. We use the parameters below to configure the module. outputfile = query stopwords = stopwords.txt docformat = web stemmer = porter we give a list of stop words to the parser and use the porter, which is the well-known stemming algorithm, stemmer to help us do better query formulation. Figure 2 shows the process of parsing query Indexing Document Collections Second, we build our search database by indexing the document collection of FBIS3 and FBIS4. The module for indexing we use is IncIndexer. This module has the ability to do incremental indexing. The parameters we use below to configure the module are index =./project/myindex396memory = stopwords = stopwords.txt docformat = trec stemmer = porter datafiles = FBIS396 We use 512 MB memory for Inv(FP)InvFPPushIndex and the same stopwords.txt we mention above. The documents format are standard TREC formatted documents. Also, the porter is used for stemming. The last parameter is FBIS396 containing list of datafiles to index. We use this module to do

3 index FBIS3 index FBIS3+ increment FBIS4 index FBIS3 + FBIS4 The Figures 3, 4, and 5 below present the indexing process of the above three Retrieving Documents The third module we use in this toolkit is RetEval. There are several models provided and we choose the popular TFIDF retrieval module. This module runs retrieval experiments with the parameters we give below retmodel = 0 index =./project/myindex396:ifptextquery =query resultfile = res396.simpletfidfresultcount = 400 resultformat = 1 doc.tfmethod = 1 query.tfmethod = 1 We use log-tf as the document term TF weighting method and the query term TF weighting method. The Figure6,7 show the retrieval of FBIS3 and FBIS3+FBIS4. 4. Experiment Following is the result of our experiment. The performance is not so good as we expected. We think there must be something wrong when we used the lemur tool. The full index time, incremental time and search time are reasonable, even the precision at 10 docs is okay. But the average precision and precision at R(30%) are quite low. We still don t know the reason exactly. We try to find out the reason for them. Table1 shows the result without stemming. Table2 shows the result with stemming. The performance is much better when the experiment is without stemming. Table1 Experiment results without stemming Full Index Incremental Search Average Precision at Precision Time(sec) time Time(sec) Precision R(30%) at 10 docs FBIS s 2.8s FBIS3+FBIS s s 5.5s

4 Table2 Experiment results with stemming Full Index Incremental Search Average Precision at Precision Time(sec) time Time(sec) Precision R(30%) at 10 docs FBIS s 2s FBIS3+FBIS4 483s 217.8s 3.4s There was a problem in our experiment. From the Fig1, we can see column 9 and 18. We use a function named RetEval. There are several parameters that we need to tune up. The parameters are shown in Figure9. 5. Conclusion This is the first time we try to use lemur system to build our IR system. We encountered lots of problems while using the toolkit. In the middle time, we still want to build our system by our own, but the Lemur toolkit supports the construction of basic text retrieval systems using language modeling methods, as well as traditional methods such as those based on the vector space model and Okapi. As the toolkit evolves, it is expected that it will support research in a broader range of information technologies such as filtering, and even question answering. In a word, the toolkit is so attractive that we still decided to use it. Lemur has many applications for indexing and retrieval that are fully functional for many purposes, so we almost use them "out of the box". In addition, since Lemur was written to facilitate research on LM and IR, the design allows us to try out new retrieval methods by subclass abstract interfaces, or write new applications based on existing methods. This is a big problem for us, because we don t know clearly about the parameters they defined. We had tried many times to tune up the parameters in order to find better results. But the behaviors are really not smart, we tried to search our problem form their public forum. This forum is for the users and developers of the Lemur toolkit to discuss the software and hare tips on using Lemur as well as to ask questions. The developers of the toolkit monitor this forum on a regular basis. In the forum, we found lots of problems in the toolkit. Some codes in lemur toolkit are wrong, and we found the error in this forum. The expected performance is not so good. 6. Appendix

5 Figure1. Flow Overview Figure2. Process of parsing query

6 Figure3. The parameter of indexing FBIS3 Figure4. The parameter of incremental indexing FBIS4 Figure5. The parameter of indexing FBIS3 + FBIS4

7 Figure6. The retrieval of FBIS3 Figure7. The retrieval of FBIS3 + FBIS4

8 Figure8. Selection in Retrieval model 7. Reference [1] CMU Lemur [2] Information Retrieval Data Structures & Algorithms

VK Multimedia Information Systems

VK Multimedia Information Systems Mathias Lux, mlux@itec.uni-klu.ac.at This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Results Exercise 01 Exercise 02 Retrieval