Contents 1. INTRODUCTION PDF Free Download

Contents 1. INTRODUCTION... 3 2. WHAT IS INFORMATION RETRIEVAL?... 4 2.1 FIRST: A DEFINITION... 4 2.1 HISTORY... 4 2.3 THE RISE OF COMPUTER TECHNOLOGY... 4 2.4 DATA RETRIEVAL VERSUS INFORMATION RETRIEVAL... 4 3. CONCEPTS AND DEFINITIONS WITHIN THE FIELD OF INFORMATION RETRIEVAL... 6 3.1 RECALL AND PRECISION... 6 3.1.1 Recall... 6 3.1.2 Precision... 7 3.2 TERM FREQUENCY AND INVERSE DOCUMENT FREQUENCY... 7 3.2.1 Term frequency (TF)... 7 3.2.2 Inverse document frequency (IDF)... 7 3.3 DOCUMENT PRE-PROCESSING... 8 3.3.1 Lexical analysis... 8 3.3.2 Elimination of stop-words... 8 3.3.3 Stemming... 9 3.3.4 Selection of index-terms... 9 3.3.5 Construction of term and document categorisation... 10 3.4 THE INDEXING PROCESS... 11 3.4.1 Inverted files... 11 3.4.2 Suffix Trees... 12 3.4.3 Signature files... 14 3.5 SEARCHING: STEPS TO BE TAKEN... 16 3.6 SEARCHING: MATCHING AND RANKING... 17 3.6.1 Boolean matching... 17 3.6.2 Vector-based matching... 18 3.6.3 Probabilistic matching... 19 3.6.4 Fuzzy matching... 20 3.7 SIMILARITY- AND WEIGHT-FUNCTIONS THAT CAN BE USED... 20 3.7.1 Tf-Idf weighting... 20 3.7.2 Signal-noise ratio... 20 3.7.3 Term discrimination value... 21 4. IMPLEMENTATION... 23 4.1 OBJECTIVE... 23 4.2 ENVIRONMENTAL ISSUES... 23 4.2.1 Hardware... 23 4.2.2 Software... 23 4.3 DOCUMENT COLLECTION PROPERTIES... 23 4.3.1 Document structure... 23 4.3.2 Language... 23 4.3.3 Topic area... 24 4.3.4 Size of collection... 24 4.3.5 Dynamics... 25 4.4 INDEXING AND RELATED ISSUES... 25 4.4.1 Implementation... 25 4.4.2 Evaluation... 29 4.4.3 Possible improvements... 29 4.5 RETRIEVAL AND RELATED ISSUES... 29 4.5.1 Implementation... 29 4.5.2 Evaluation... 31 4.5.3 Possible improvements... 32 4.6 TESTING AND RESULTS... 32 4.6.1 Strategy for testing... 32 1

4.6.2 Results when stemming is not used... 35 4.6.3 Results when stemming is used... 40 4.6.4 General observations... 43 5. EVALUATION ISSUES... 45 5.1 PROBLEMS ENCOUNTERED... 45 5.1.1 Test-data... 45 5.1.2 Domain expertise... 45 5.1.3 Phrase searching implementation... 45 6. CONCLUSION... 46 7. BIBLIOGRAPHY... 48 2

1. Introduction The science of information retrieval nowadays can be seen as one of the most important technologies that help us to find all textual information we need within a structure that we can search and analyze. Since the explosion of digital texts through the World Wide Web, this information is stored and spread over billions of documents increasing the demand for smart retrieval systems. This smart -ness of information retrieval systems can be achieved by numerous techniques. These techniques should interpret data in such a way that returned documents are something more than just the result of a simple data-comparison between query and document collection. The idea is that a query represents a certain need for information instead of merely data, so the answer to the query has to reflect this need. In this thesis I will explain several techniques through which retrieval of data can be enhanced towards retrieval of information. My main focus however will be towards the specific technique of stemming. This technique makes use of the fact that most of the words in a text are derived from a base stem. In order to make sure that documents related to a query are retrieved even though the actual word searched for may not explicitly occur within a text, for example: connections, the system stems the query-words and the words in the vocabulary so that query and vocabulary only contain stems of words. In this example connect would be the word the query is transformed to, and all words derived from connect (connections, connectivity, connecting) would be represented by the stemmed word connect in the vocabulary. Through this method one can achieve a higher number of retrieved documents. Additionally, some sort of semantic clustering takes place since most words derived from base stems can be considered as semantically correlated to the base stem. Of course this method has a negative tradeoff. This tradeoff occurs with the notion of precision. That is, if one generalizes specific words into base stems, it is harder to get to a precise answer to a query. Hence, when searching for connections, a document that contains the word connectivity twice will score equally compared to a document containing twice the search-word connections. In this way, you lose precision with respect to what is actually searched for. One would expect the document explicitly containing the word connectivity to score higher (which it does not in this case). I am interested in what specific advantages and disadvantages this tradeoff brings about. To test the use of stemming, an implementation of a basic search engine based on inverted files will be discussed. The document collection that will be searched is one of the USA-Web located at http://www.let.rug.nl/~usa. This website contains HMTL-documents which contain information on the history of the United States of America from the colonial period untill present times and is at the moment of writing a relatively small collection containing about 3500 documents in the English language. First (chapter 2), I will describe several distinct qualities of an IR and provide a definition of Information Retrieval. After this description, in chapter 3, a summary of the basic concepts and ideas within the field of Information Retrieval is given. This is done in order to give an informative background of the field so that one will feel comfortable with the several aspects discussed in this thesis. Also, information on different techniques is given in order to provide for an understanding of my choice for the use of an inverted file based IR system as opposed to vector based or probabilistic matching. In chapter 4, the implementation will be discussed, followed by an evaluation of results obtained through this implementation (chapter 5). Finally, I will give a conclusion (chapter 6) that should give a satisfying answer to the question What advantages and disadvantages can be seen with respect to the use of a stemming-algorithm in an Information Retrieval system applied to a relatively small single-language document-collection?. 3

2. What is Information Retrieval? 2.1 First: a definition For a good formal definition of Information Retrieval I would like to quote a definition given in Baeza- Yates & Ribeiro-Neto (1990:p.1): Information Retrieval deals with the representation, storage, organisation of, and access to information items. The organisation and access of information items should provide the user with easy access to the information in which he is interested.. In the next paragraphs I will briefly explain why this definition incorporates all of the important features of a good Information Retrieval system. 2.1 History The history of Information Retrieval goes back as far as the start of paper archiving in an environment where the maintainers of a collection of information written on paper were not able to store all necessary information about the collection in their memories anymore. After 1940 this situation became more and more critical. Therefore a discussion started about how to store and retrieve documents in a document collection where the retrieval part did not mean going through all the text available in that collection (for that became a too time-consuming task because of the size of the document collection). The idea was that efforts should be made to store and retrieve documents from a system representing the complete document collection as well as possible whereas a search-query on that representative system would not have to be as costly as scanning through all the text within that document collection. The first systems that tried to do this were (naturally) designed in libraries where the problem of large document collections at first arose. These systems were paper-based themselves (i.e. no computerbased systems were available or even under construction at the time). They relied on extensive categorisation using a system of cards where on each card a short description in key-words or natural language was printed that would describe the document (the card represented) in the document collection as precisely as possible. These document cards were then categorised into more general categories in a recursive process until the top-categorisation level was reached. When searching for a document one would start making the first decision in what general top-category to begin, followed by travelling down the tree-structure into more specific categories until the cards of the documents in the category best specifying the query of the user were retrieved. These cards found under that category in the ideal case would exactly represent the subset of documents in the document collection the user was interested in. In many cases however, the classification would also lead to erroneous retrieval of documents that is the retrieval of documents the user was not interested in. 2.3 The rise of computer technology A lot of research since then has been done, starting with improving the paper-based systems and designing all kinds of algorithms and statistical tests for measuring the performance of the systems while along the way computers emerged as useful helpers. The gradual improvement of the performance of computers made the focus in time shift more towards implementing the algorithms and designing computer-programs focussed on automating the systems of storing and retrieving information. 2.4 Data Retrieval versus Information Retrieval Accompanying the progress made in the computer hardware industry, instead of focussing on the retrieval of data by matching the words in a query with those in the document collection, one started to focus more on the extraction of information. Here the distinction between data retrieval and information retrieval became important. In a data retrieval approach one matches the words entered by the user to the occurrence of these words in the documents searched. Only documents that contain these exact words will be returned. The returned result-set is unordered, so nothing can be said about the importance of a certain document in comparison to another. Also, these results in no way guarantee any semantic relevance 4

of the returned documents with respect to what the user is searching for, except for the fact that these documents contain the words (data) the user is interested in. The fact that the words themselves can have more meanings than one (ambiguity) is not accounted for. So, the results will on a data-level always match the search request, but due to ambiguity and a lack of any semantic knowledge about the data, the information in the result set may not match the users interest. An information retrieval system elaborates on the conceptual question a user has, and tries to answer this question by supplying the information most relevant to the user s information need. The techniques that perform this task try to interpret the query as a real question (a user wants to know something about a certain subject) instead of an order to retrieve specific data. After this interpretation the system can expand and/or modify the user s request so that it weighs the data and therefore can return documents weighed according to their presumed relevance to the user s query. There are different approaches to how one can weigh data and expand queries of which I will discuss the most common techniques in the next chapter. The main advantage with respect to the focus on information retrieval versus data retrieval is that a more informative result relevant to the user can be given instead of only physical matches to the words in the query. The notion of relevance therefore is the key word if you are dealing with the retrieval of information. 5

3. Concepts and definitions within the field of Information Retrieval Since there are quite a number of specific concepts and definitions within the field of IR, this chapter focuses on providing some basic knowledge about the meaning and use of different key concepts. 3.1 Recall and Precision When rating an IR system one has to have some guidelines as to what is important in measuring the overall performance of a system? Besides the trade-off between the time, space and effort it takes to build an index (designer) and to query it (designer and user), one has to be aware of what is important to the user, namely what kind of results are satisfactory? This is when two terms are introduced, the term recall, and the term precision. 3.1.1 Recall The notion of recall can be viewed as being the ability of the system to retrieve all relevant items matching a query to the user. It can be computed as follows: R = Ret Rel Tot Rel Where: R : Recall Ratio Re t Rel : the total number of items retrieved and relevant Tot Re l : the total number of relevant items in the collection Illustration of the terms used: Document Collection Total relevant Items Total Total retrieved Items Relevant & Retrieved Fig 3.1 6

Thus, if the system retrieves all the documents from the collection that are relevant to the query, the recall ratio is 1. Note that relevancy is a somewhat vague term. That is, one user for the same query might find some document relevant whereas another user might not. This makes it very hard to determine relevancy without using statistical measures over a group of users. 3.1.2 Precision Precision is a ratio that compares the number of relevant documents found to the total number of returned documents. It can be computed as follows: P = Ret Rel Tot Ret Where: P :Precision ratio Re t Rel : the total number if items retrieved and relevant Tot Re t : the total number of retrieved items See Illustration 3.1 for visualization of the terms used. Thus, when the result of a certain query contains only relevant items to this query, the precision ratio is 1. Again, the problem with relevancy is that it is not a fixed value that can be measured. It can be different for anyone using the system. 3.2 Term frequency and Inverse Document Frequency In the following paragraphs, term frequency (TF) and inverse document frequency (IDF) are frequently mentioned. I will hereby explain what they are. 3.2.1 Term frequency (TF) The Tf-factor is the relative frequency of a term in a document. This relative frequency is a measure for how well a certain term describes a document (intra-document characterisation). Since a term that occurs frequently in a certain document we assume that this word has some significant relation to the subject the document is about. It can be computed as follows: Let Freq i, j be the raw frequency of term i F i, j of term i F i, j = n l= 1 K in document Freq i, j Freq l, j D j is given by: K in the document Where n is the number of unique terms (tokens) in document D j. Then the normalized frequency D j. 3.2.2 Inverse document frequency (IDF) The Idf-factor is the inverse of the frequency of a term measured over all the documents in the collection. This value is a measure for how well a term is a good discriminator (inter-document characterization) for determining the relevance of a document. Even though a term might occur frequently in a certain document, if it occurs frequently in all documents than it is not very useful for distinguishing a relevant document from a non-relevant one. One can imagine that in a document collection about the history of the United States of America, the word america occurs frequent in 7

each document. The term may -in this case- indicate something about the subject of the document, but since all documents share (partly) this subject, it does not add any value to our discriminating process. It can be computed as follows: Let N be the total Number of documents in the system and the index term K i appears: Idf = log( N) log( N ) i i N i be the number of documents in which 3.3 Document pre-processing When indexing a document collection, one could assume that all the words in the documents are relevant as index-terms. This however is not the case for a lot of words. Articles for instance are so frequent in natural language that they will hardly distinct one document from the other in the searchprocess. For example, searching for the word the will probably retrieve the complete document collection. This is not desirable, and therefore these frequent words, also called stop-words, are better to be left out of the index, since they don t have any discriminating value. Also, consider punctuation characters, should they be included in the index? For example, if a query is formed as the president s wife, do you want to search for president s as an index-entry, or would a search for president yield the better answer? The first one will result in an exact match of what the user asked, but isn t the user actually interested in co-occurrences of president and wife? The last one returns some more documents concerning the words wife and president, even though the exact phrase president s wife might not be in the result-set. So recall is higher but at the cost of precision. So, instead of indexing all the words and characters as they are within the documents of the document collection, it is highly recommended that the text is pre-processed in order to determine which words and characters should be indexed in which should not. Pre-processing has several steps that can be considered. Not all of the steps are necessary for all systems, since the algorithms performing the tasks might reduce the speed of indexing and/or retrieval. Below I will make a summary together with a brief description on what steps can be taken. With each step I will argue whether the algorithm will be useful with respect to the IR system I am creating, taking into account the size of the document collection it is intended to serve. 3.3.1 Lexical analysis Lexical analysis of the texts that are to be indexed involves the identification of what characters are to be indexed, and what determines the boundaries for the terms. For instance, if an IR system has to handle only word queries, one would want to identify only all words within a document. However if phrases are to be indexed too, other boundaries might have to be used in order to identify phrases instead of words. Also, the treatment of special characters such as hyphens, digits, punctuation marks and spaces need to be carefully considered. In my implementation I have chosen to not index punctuation characters since they do not seem relevant to the subject of the history of the United States (the field of scope of my document collection), but I have indexed numbers, for historical dates are very likely to be of importance to users. 3.3.2 Elimination of stop-words Words like the and in are words likely to occur in almost all the documents in the document collection (if the texts are in English of course) and therefore can hardly provide for a distinction between documents concerning relevance. The best way to avoid retrieving too many documents that do not particularly match the user s query is to filter out these words (also called stop-words). One way to do this is to use the relative frequencies of the words within the documents as a threshold-value for determining whether a word has significant meaning to a document s subject or not. The words that 8

are natural candidates for a list of stop-words are articles, prepositions and conjunctions. An additional positive side-effect of filtering out the stop-words in the texts is that it reduces the size of the indexing structure considerably with amounts that can rise up to 40 percent of the initial size of the structure. Since the document collection I am working in is relatively small, I have decided not to make use of the elimination of stop-words. Also, it may prove interesting to see what happens if I compare a technique such as the tf-idf (to be explained in detail in paragraph 3.7.1) which incorporates the distribution of terms (hereby typically lowering weights for stop-words in a query) over the document collection to a technique that does not. 3.3.3 Stemming Because of the fact that certain words in the texts are derived from base words, some documents that contain relevant information according to the user s query might not be found because the exact word the user entered the query is not present in the document in the literal string. However, the word entered might be a derivation from a base word that might very well be present in the documents not found initially. For instance, a query that contains the word industrialization will without stemming only return documents actually containing this specific word whereas documents containing words as industry or industrial will not match even though they might very well contain interesting information about industrialization. In contrast, if the user enters a base word in his or her query, one can safely assume that derivations of this word in documents will also increase the relevance of these documents for the user. For instance, if a query contains the word industry, it is very likely that documents containing words such as industrialization and industrial are also of interest to the user whereas in a non-stemmed system these documents will not be returned. In order to achieve a system that can overcome both these inefficiencies the usage of a stemming algorithm can improve results significantly. A typical stemming algorithm removes prefixes and suffixes from words that are to be indexed in order to get to the base word this word is derived from. This base word then is indexed, and thus within the index, this word points to all occurrences of the base words and all of its derivations in the documents. By using this technique one can find related documents to a query instead of only the exact matches hereby ideally improving recall. The improvement of recall usually goes at a cost of a certain amount of precision. This is mainly because more documents are retrieved whereas the query is interpreted more generally by reducing query-words to their base-words. With the use of a stemming algorithm one reduces the size of the vocabulary and thus size of the index, which is also an advantage especially on larger document collections. The algorithm most widely used for the task of stemming is the Porter Stemming algorithm Baeza-Yates & Ribeiro-Neto (1990:p.433). This is also the one I have used in my implementation in order to perform my research with respect to my thesis question (see Chapter 1). 3.3.4 Selection of index-terms When determining the importance of certain words within a text one can use different kinds of measures for what words are important factors in describing well the semantic content of a document. This determination of index-words can be done manually, automatically and through a combination of both. When manually determining the importance of words within texts it is usually so that someone, who is specialised within the field the documents are about, determines which words should be indexed and which not. This in itself is already a time-consuming task but on top of this, when new texts are added, this task needs to be re-performed every time the scope of the document-collection is expanded with new concepts and therefore new possible index-terms. Thus, there is a need for automation. Automatically determining the value of words as index terms can be done using the syntax and the semantics of the text. A technique that has proven to give good results is the technique of identifying noun groups Baeza-Yates & Ribeiro-Neto (1990:p.169). This technique involves stripping the text of words that are not nouns. The idea behind this is that noun words convey the most of the semantics of a text as opposed to word types such as verbs, adjectives, articles etc Instead of simply using these nouns as index terms, another process takes places beforehand. During stripping the non-noun words, we count the number of non-noun words that exist between two nouns. By defining a threshold (for instance 3 words may exist between two nouns) we can combine nouns that co-exist into one noun-group which in the end will be indexed as being relevant sets of index- 9

terms that represent a conceptual logical view of the documents. Since this technique is more based on a syntactic and semantic approach, whereas I am focussing more on the probabilistic ways of indexing, I have chosen to select index-terms based on the techniques described above neither manually, nor automatically. Instead, I will use frequency-information as a measure for termimportance. 3.3.5 Construction of term and document categorisation The construction of term categorisation is done for the purpose of being able to show all related documents to a query instead of only exact matches. By creating categories into which words can be sorted the search mechanism can expand the query the user feeds into the system into a broader query in order to find related documents. Two ways of categorisation are widely used at the moment, one is the construction of a Thesaurus and the other is use of document Clustering. 3.3.5.1 Thesaurus A thesaurus is a list of words that are considered to be the most important or significant within a certain domain of knowledge, and for each word in this list, a set of related words is mapped onto it. A thesaurus can be created manually. In this case it has to be done by someone who is an expert in the field of knowledge. This person will decide on what words should be used as index-words and in what category these words belong. When a search is done, the system will check in which category the query-terms occur, and is then capable of returning documents, which contain terms within the same category as the query-terms. The manual process of determining what words should be indexed, and in what category they belong however is a very time consuming task, especially when the document collection is dynamic and the content has to be updated frequently. 3.3.5.2 Document clustering Document clustering is a method that focuses on identifying similarities between the documents within a document collection based mostly on the importance of certain terms for certain conceptual classes of documents. A distinction has been made between local clustering methods and global clustering methods. Local clustering means that clustering is performed on the outcome of a query. The outcome is the subset of the document collection that matches a certain user query. This method is used a lot in relevance feedback cycles, where one finds the most relevant terms by focussing on the documents marked as relevant to the query by the user. For instance, if a query such as personal history of president Carter is entered, documents will be returned which contain either one, some of all of the words in the query. However, there may be many documents that say a lot about president Carter s political life but have nothing to do with his personal life. The user now can browse this result set and mark the documents that are about this president s personal life. The system will now be able to adjust the user query and add terms to it that best describe the content of the documents the user marked as being relevant. This new query will then be re-submitted to the system, resulting in a new set of matching documents. On this new set of documents, the user can again supply the information on what documents are most relevant. This cycle can go on until the user is satisfied with the documents returned. With every cycle, the relevance of the documents should improve. Because of queryexpansion, the recall does not have to suffer from the search for a higher relevance. Global clustering, instead of focusing on a subset of the document set, uses the complete document collection for clustering purposes. This technique is in most cases realised by creating a vector for each document within the collection after which the vectors are compared to one another and a degree of similarity can be computed. The degree of similarity then clusters documents into classes that should resemble certain subjects or fields of information. These global clusters can be used to present related documents to the documents retrieved by an initial query of a user. Ideally, this would provide the user with more information about the subject searched for, and thus the information that was not found initially according to the query. Both techniques have proven to improve precision significantly especially when they are combined. However, classification of documents, globally and locally, is a time consuming task. For global classification the vectors need to be recomputed every time a new document is added to the collection. Local classification perhaps provides a more precise answer to a query in the end, but the 10

computing and comparing of the vectors during search time is also a factor that slows down the search process. Since both of the above mentioned techniques imply a substantial increase in search-time and/or indexing-time I have chosen not to use them in my implementation. 3.4 The indexing process The process of indexing is a fundamental part of the creation of an IR system. Indexing means that instead of seeking matches in a document collection by searching linearly through the original documents we try to create a system that allows searching through a representation of this document collection. This representation then is called our index. The query submitted to the system should be in the same format as the index in order to be able to determine the matches between the query and the document set. This doesn t mean that the user has to formulate the query in this same format. It is common that after a user submits the query to the system, it is converted into the desired format by the system internally. After this conversion the reformatted query is compared to the index, and a result set of matching documents is retrieved and returned to the user in a user-friendly format. An index can be created using various distinct techniques of which I will only explain the ones most commonly used at the moment and which have proven themselves to deliver the best results. These are; inverted files, suffix trees and signature files. A short description of these techniques will be described below following paragraphs 8.2 and 8.3 of Baeza-Yates & Ribeiro-Neto (1990:p.203). While doing this, I will also briefly explain my choice for inverted files as the technique to be used for my IR system. 3.4.1 Inverted files Perhaps the most commonly used method for indexing is the use of inverted files. Inverted files are files that use a vocabulary over all documents in our collection in combination with pointers to the positions within the documents in which the words of the vocabulary occur (see Fig. 3.2). In this way one can search for a queried word in the vocabulary and when found use the list of pointers associated with this word to return the result-set of documents in which the word is found. In combination with Boolean operators one can combine these result-sets for each individual word in the query into a final result-set (unsorted yet). By using ranking algorithms one can measure how well certain documents match against a query. Given this information the documents in the result-set can be ranked according to their relevance so that the document best matching the user s query will be shown first in the list. The different ranking algorithms and techniques that exist will be discussed in paragraphs 3.6 and 3.7. Vocabulary: a an border broken common Index: A an border broken Etc Doc1;Doc2 Doc2;Doc3 Doc2;Doc4;Doc6 Doc3;Doc5 Fig. 3.2 11

Inverted indices have the advantage of relatively easy implementation and a high performance speed. These two advantages however only hold when doing word searching in texts. For phrase and suffix searching this method has been proven less efficient than other methods. 3.4.2 Suffix Trees Using suffix trees is another way of indexing textual documents. A suffix tree was first presented in Manber & Myers (1990:pp.319-327). A good explanation of the use of this technique is given in Baeza-Yates & Ribeiro-Neto (1990: p.199): This structure can be used to index only words (without stopwords) as the inverted index as well as to index any text character. This makes it suitable for a wider spectrum of applications, such as genetic databases. However, for word-based applications, inverted files perform better unless complex queries are an important issue. This index sees the text as one long string. Each position in the text is considered as a text suffix (i.e. a string that goes from that text position to the end of the text). It is not difficult to see that two suffixes starting at different positions are lexicographically different (assume that a character smaller than all the rest is placed at the end of the text). Each suffix is thus uniquely identified by its position. Not all text positions need to be indexed. Index points are selected from the text, which point to the beginning of the text positions that will be retrievable. For instance, it is possible to index only word beginnings to have a functionality similar to inverted indices. Those elements which are not index points are not retrievable (as in an inverted index it is not possible to retrieve the middle of a word). See fig. 3.3 as taken from Baeza-Yates & Ribeiro-Neto (1990: p.200)) for an illustration. This is a text. A text has many words. Words are made from letters. Text text. A text has many words. Words are made from letters. text has many words. Words are made from letters. many words. Words are made from letters. words. Words are made from letters. Suffixes Words are made from letters. made from letters. letters. Fig. 3.3 If we would index only word-beginnings, we would have functionality similar to inverted files. 12

This technique has the advantage that it can index complete words (as inverted files do) but is also capable of indexing any single character in the text. This feature allows for the use of genetic algorithms and also performs better during search-time when answering complex queries (such as phrase searching) as opposed to inverted files that do not have this advantage. The suffix tree can be composed using a trie data structure or an array data structure (see Fig. 3.4). The trie structure has as its disadvantage that the space it occupies lies between 120% and 240% overhead over the text size and thus is not very suitable if you are working on large text collections. To improve space utilization, this data structure can be compacted into a Patricia tree (more information on Patricia-trees can be found in Gonnet (1987) under the name of PAT-arrays). Compacting the suffix trie into a Patricia tree (see second illustration in fig. 3.4) involves compressing all unary paths, i.e. paths where each node has just one child. One can see the result of compressing these paths in figure 3.4 where the tree is illustrated. For example, one can see that, since there s only one word starting with a t (namely text) only the part of the suffix that is uniquely identifying the suffix (namely the last character in this case) is indexed after the t. This saves up considerable amounts of space. The technique using suffix arrays stores all pointers to the text-suffices in an array and is in space requirements about the same as inverted indices (i.e. close to 40% overhead over the text size). Although the suffix array can achieve about the same results in search time as inverted indices, and has the advantages of not degrading in search performance on phrases (complex queries), the building time of this structure is about 5 to 10 times slower than the one of inverted files. Suffix arrays occupy a more limited amount of space than the trie structured approach and therefore is the most widely used technique with respect to suffix searching. Since complex queries and phrase-searching are not my main focus and the costs of building the index is considerably higher (i.e. takes more space (suffix tries) or time (suffix arrays)) than the costs necessary when using inverted indexes, the techniques using suffix-indexing and searching may be interesting as a whole but do not outperform the technique of inverted files in my comparison on what technique to use for the indexing task to be performed. 13

1 6 9 11 17 19 24 28 33 40 46 50 55 60 This is a text. A text has many words. Words are made from letters. Text Suffix Trie l m 60 50 d a n 28 t 19 e x t w. 11 40 o r d s. 33 Suffix Tree l 60 50 d m 3 28 n t 19 5 w. 11 6 40. Compacted 33 Fig. 3.4 3.4.3 Signature files Signature files are a way of indexing based on hashing techniques. By means of a hashing function one can compute unique signatures for each term in the vocabulary. Then we can split up the documents into blocks, where for each block we compute a signature by disjunctively combining all the signatures of each term occurring in this block. If a word occurs in a block of text, then we expect that 14

all the bits that are set in the term-signature must also be set in the block-signature (see Fig. 3.5 taken from Baeza-Yates & Ribeiro-Neto (1990: p.206)). A query -with respect to this technique- is represented also as a signature. If we enter a query following the signatures shown in Fig. 3.5- with the word many, the query-signature will be 110000. This query-signature will now be matched against the document/block-signatures by means of bit wise AND-ing their values in order to obtain matches. The idea behind this is that a document matches a query if the outcome of the AND-operation on the document-signature and the query-signature has all the bit-values set that are also set in the query-signature. For example, the search for many will result in the AND-operations with their respective output depicted in Fig. 3.6 Block 1 Block 2 Block 3 Block 4 This is a text. A text has many words. Words are made from letters. 000101 110101 100100 101101 Text signature Signature-function : H(text) = 000101 H(many) = 110000 H(words) = 100100 H(made) = 001100 H(letters) = 100001 Fig. 3.5 Query: many Query-signature: 110000 Block1 -> 000101 AND 110000 = 000000 Block2 -> 110101 AND 110000 = 110000 Block3 -> 100100 AND 110000 = 100000 Block4 -> 101101 AND 110000 = 100000 We can now compare the resulting bit-signatures with the original query-signature. The only matching Block then turns out to be Block2 (which also has the first two bits set ). This is correct, since Block2 indeed is the only block that contains the word many. Fig. 3.6 The space overhead for this indexing technique is between 10 and 20 percent over the text size. This is the most positive feature of signature files, especially for larger document collection. Opposed to this however is the fact that searching has to be done sequentially which is an important disadvantage especially when searching in larger document collections. Thus, if space is more important than searching speed, this method proves itself very worthy, but otherwise (which is mostly the case, since a short search time is usually rated as one of the more important features of an IR system) inverted files outperform this method for most applications. Since the system I will discuss in the following chapters focuses more on speed of retrieval, precision and recall, instead of storing space, this technique also is not favourable compared to the technique of inverted files. 15

3.5 Searching: steps to be taken The searching process is dependent on the type of index created. Since I have chosen to use inverted files, I will limit my discussion to the search mechanisms, which search in inverted files. A classical inverted file system contains two files, the vocabulary file and the inverted file itself. The vocabulary file is used to perform a binary search (the vocabulary file is sorted alphabetically) for the occurrences of the words extracted from the query. The words in the vocabulary file are typically mapped to the position of the corresponding value for this word in the inverted file. The reason it is done this way is that the vocabulary even for large text collections can often be stored in main memory and therefore provides a powerful fast initial searching process. The following steps are used (see figure 3.7 (extended from fig. 3.2)) to get from the query to the result set of matching documents: Vocabulary search: The words in the query are isolated and searched for in the vocabulary. Retrieval of occurrences: The mappings found in the vocabulary are used to extract from the inverted file all of the documents that match the individual words together with possible discriminator values computed for this word within the document or document collection. Manipulation of occurrences: The individual occurrences of the words in the query are combined to solve phrases, proximity or Boolean operations (only Boolean operations are shown in this figure). After this process one set remains in which the documents that match the query are contained. Query: a AND border Vocabulary: a an border broken common Index: An Border Broken Etc Doc2;Doc3 Doc2;Doc4;Doc6 Doc3;Doc5 A Doc1;Doc2 Manipulation of occurrences: <Doc1;Doc2> 16

AND <Doc2;Doc3;Doc6> Result: <Doc2> Here, Document 2 is the only document found in which both terms occur, therefore it will be returned to the user as matching the query. Fig. 3.7 Once these steps are carried out we have a set containing the documents, which completely or partially match the query, but how can we express the distinctions between these documents in terms of relevance? This is a question that has led to many observations and implementations of possible ranking algorithms. To understand and make a good choice between these ranking algorithms, I will describe the algorithms most common in the field and elaborate on their advantages and disadvantages with respect to the task to be performed. 3.6 Searching: matching and ranking When querying an information retrieval system, one can choose from different strategies on how to perform the search on the document collection. Below I will describe some retrieval techniques currently known. 3.6.1 Boolean matching When retrieval systems were first devised for larger document collections it was done through punched cards and edge-notched cards that only held information in the form of bit representation. With this representation one could for instance make a signature card for each document and match a query against the collection of signatures through Boolean connectives such as AND, OR and NOT. This system of retrieval is called the Boolean method of retrieval and is based on set theory. Many systems still rely on this Boolean system (now using electronics instead of cards), but improvements, adjustments and additions have been made because of the fact that a Boolean system alone does not provide for good ranking. The Boolean technique is drawing a hard line between documents that are relevant and documents that are non-relevant to a query (either the document does contain one or several of the words mentioned in the query, or it doesn t). This makes it sensitive to mistakes in the form of not selecting documents which are conceptually equivalent to the query but which do not contain the specific words mentioned in the query. Also, when using the basic form of Boolean retrieval, there is no ranking possible between the documents that match the query if they contain the same number of query words. This means that if two documents both have two words that are in the query, they both qualify for being retrieved, but which one should be shown to the user with the higher priority? For these and other issues, the Boolean system has been combined with the other techniques in order to get the best of both worlds. This combination with other techniques is called the extended Boolean model. Although the Boolean technique in its basic form has been criticised since its inception, it is still used widely because of its ease of implementation and its simplicity in logic. Since combinations with probabilistic and weighting techniques have been tested, the extended Boolean technique is still one of the major techniques used and can compete against (or better; be combined with) other techniques such as proximity, fuzzy, probabilistic and vector based matching. I will implement an OR-based Boolean system as a basis for ranking my results. This means that with respect to the query, all matches to the separate terms occurring in this query will be OR-ed together to form a final set of (partly) matching documents. In addition to this I will use extended Boolean techniques that also incorporates term-frequency (see paragraph 3.2.1) and inverse document frequency (see paragraph 3.2.2) values in order to decide the relevance of terms to documents, and hereby the relevance of certain documents to the query containing these terms. These techniques will also account for a higher ranking of documents that contain all query-terms as opposed to those documents that contain only a few, one or zero of the query-terms. 17

3.6.2 Vector-based matching As the Boolean technique alone showed some deficiencies after its discovery and use in the early days of Information Retrieval, alternative methods of retrieval were developed. One of the first successful alternative systems was the SMART system as can be found in Lesk(1964). It was the first system showing the results of successful vector-based matching. There are various distinct techniques to implement vector based matching, and these techniques internally can make use of different similarity functions that discriminate between matching and non-matching documents (that is, how similar is a document to a query?). Two concepts can be recognised, the concept of vector based matching through metrics, and the concept of vector-based matching through an angular measure. Both concepts are described below. 3.6.2.1 Metrics- or distance-based If you perform vector-based matching through metrics, one may measure similarity by computing the distance between the documents and the query in the document space (see fig. 3.8). If two documents have a similarity distance of 0, the documents are equal or the same (the distance of a document to its self is always 0). Hence, the closer the distance between a document and a query in the documentspace, the more likely it is that this document matches the query well. Document-space in a Metrics-based system D2 Q1 D3 D1 D4 Here, the arrows represent the absolute distance (on whatever scale) between the Documents (D1..D4) and the Query (Q1). There is no fixed origin, and therefore the arrows are not to be seen as angular vectors, but as pure metrical distance-indicators. Fig. 3.8 3.6.2.2 Angle-based Another way of looking at vector-based matching is the idea of looking at the angles of the document vectors and the query-vector in the document space with respect to a certain origin (as illustrated in Fig 3.9). This method is called an angular measure, and one widely used example of this kind of measure is the cosine measure (as described in Wilkinson & Hingston (1991: pp.202-210)). This measure calculates the cosine of the angle between the vectors (with respect to the origin) representing the document and the query (or two documents, for document-to-document-similarity). Document-space in an Angle-based system y (x,y,z)-coords. : D2 D1 : (1, 2, 4) D2 : (3, 7, 4) Q1 D3 D3 : (7, 4, 1) D4 : (7, 1, 4) Q1 : (3, 5, 3) x 18

D1 D4 z Here, the arrows represent the vectors for the Documents (D1..D4) and the Query (Q1). They are originating from the Origin. The distance of the physical locations of the Documents and the Query are not important, only the angle of their vectors towards the Origin. Fig. 3.9 These measures (angle- and metric-based) analyse the document space differently and use different techniques to define similarity between documents and the query. The distance measure approach is looking at a group of documents and uses only the distance between the documents in document space, and therefore discards any fixed point onto which the values may be mapped. This makes this method intrinsic -- it only focuses on the internal relations between the objects. That is, from a given point in the document space, all directions are considered equal. However, the angular measure, because angles can only be computed from a fixed point, is extrinsic. If you change the indexing process (by changing for instance the term-weighting scheme), the origin from which the angles are measured can change. The documents can hereby be assigned a different point in the document space that will can lead to a different angle towards the origin. This means that the fixed point (origin) in this method is a determining factor in how the document similarity is computed. Additionally, since distances are not considered important in the angular measure, it is possible that two documents that have the same angle towards the origin are far apart in the document space. Whereas the angular measure would consider them highly similar, the distance measure would conclude the opposite considering the distance between them. If, for instance (according to Korfhage (1997:p.85)), there are three documents each described by the same two terms, with document vectors D 1 = <1,3> D 2 = <100,300>, and D = <3,1> 3 By the cosine measure (which can be found in Korfhage(1997:p.84))) the similarity between D 1 and D 2 would be 1.0 and the similarity between D 1 and D 3 would be 0.6. Using Euclidian distance, we see that the distance between D 1 and D 2 is 314.96 and the distance between D 1 and D 3 is 2.83. It can be argued that in D 1 and D 2 the two terms have the same relative importance; that is, that the ratio of their values is the same. However, D 3 is much more closer to D 1 than is D 2 in the document space. Should the ratio of the distance be the significant measure in this case? One could argue that it should not. 3.6.3 Probabilistic matching The basics of probabilistic matching are as follows. The documents in the collection all can be estimated to have a certain probability to be relevant or non-relevant with respect to a certain query. If this relevance is combined with the probability of query-terms occurring in these documents, a discrimination value can be computed. This discrimination value is the value that tells us whether a document has a higher probability of being relevant to a query compared to a randomly chosen document from the collection. If this is the case, than this document should be selected as part of the result set to be given to the user as answer to his or her query. Probabilistic matching is not a method that has proven to give significantly better results than vector based or Boolean techniques. Since the process of probabilistic matching is also very time consuming because of all the effort that has to be done in order to compute all probabilities I will not deeply go into this subject. 19