Contents 1. INTRODUCTION... 3

Size: px
Start display at page:

Download "Contents 1. INTRODUCTION... 3"

Transcription

1 Contents 1. INTRODUCTION WHAT IS INFORMATION RETRIEVAL? FIRST: A DEFINITION HISTORY THE RISE OF COMPUTER TECHNOLOGY DATA RETRIEVAL VERSUS INFORMATION RETRIEVAL CONCEPTS AND DEFINITIONS WITHIN THE FIELD OF INFORMATION RETRIEVAL RECALL AND PRECISION Recall Precision TERM FREQUENCY AND INVERSE DOCUMENT FREQUENCY Term frequency (TF) Inverse document frequency (IDF) DOCUMENT PRE-PROCESSING Lexical analysis Elimination of stop-words Stemming Selection of index-terms Construction of term and document categorisation THE INDEXING PROCESS Inverted files Suffix Trees Signature files SEARCHING: STEPS TO BE TAKEN SEARCHING: MATCHING AND RANKING Boolean matching Vector-based matching Probabilistic matching Fuzzy matching SIMILARITY- AND WEIGHT-FUNCTIONS THAT CAN BE USED Tf-Idf weighting Signal-noise ratio Term discrimination value IMPLEMENTATION OBJECTIVE ENVIRONMENTAL ISSUES Hardware Software DOCUMENT COLLECTION PROPERTIES Document structure Language Topic area Size of collection Dynamics INDEXING AND RELATED ISSUES Implementation Evaluation Possible improvements RETRIEVAL AND RELATED ISSUES Implementation Evaluation Possible improvements TESTING AND RESULTS Strategy for testing

2 4.6.2 Results when stemming is not used Results when stemming is used General observations EVALUATION ISSUES PROBLEMS ENCOUNTERED Test-data Domain expertise Phrase searching implementation CONCLUSION BIBLIOGRAPHY

3 1. Introduction The science of information retrieval nowadays can be seen as one of the most important technologies that help us to find all textual information we need within a structure that we can search and analyze. Since the explosion of digital texts through the World Wide Web, this information is stored and spread over billions of documents increasing the demand for smart retrieval systems. This smart -ness of information retrieval systems can be achieved by numerous techniques. These techniques should interpret data in such a way that returned documents are something more than just the result of a simple data-comparison between query and document collection. The idea is that a query represents a certain need for information instead of merely data, so the answer to the query has to reflect this need. In this thesis I will explain several techniques through which retrieval of data can be enhanced towards retrieval of information. My main focus however will be towards the specific technique of stemming. This technique makes use of the fact that most of the words in a text are derived from a base stem. In order to make sure that documents related to a query are retrieved even though the actual word searched for may not explicitly occur within a text, for example: connections, the system stems the query-words and the words in the vocabulary so that query and vocabulary only contain stems of words. In this example connect would be the word the query is transformed to, and all words derived from connect (connections, connectivity, connecting) would be represented by the stemmed word connect in the vocabulary. Through this method one can achieve a higher number of retrieved documents. Additionally, some sort of semantic clustering takes place since most words derived from base stems can be considered as semantically correlated to the base stem. Of course this method has a negative tradeoff. This tradeoff occurs with the notion of precision. That is, if one generalizes specific words into base stems, it is harder to get to a precise answer to a query. Hence, when searching for connections, a document that contains the word connectivity twice will score equally compared to a document containing twice the search-word connections. In this way, you lose precision with respect to what is actually searched for. One would expect the document explicitly containing the word connectivity to score higher (which it does not in this case). I am interested in what specific advantages and disadvantages this tradeoff brings about. To test the use of stemming, an implementation of a basic search engine based on inverted files will be discussed. The document collection that will be searched is one of the USA-Web located at This website contains HMTL-documents which contain information on the history of the United States of America from the colonial period untill present times and is at the moment of writing a relatively small collection containing about 3500 documents in the English language. First (chapter 2), I will describe several distinct qualities of an IR and provide a definition of Information Retrieval. After this description, in chapter 3, a summary of the basic concepts and ideas within the field of Information Retrieval is given. This is done in order to give an informative background of the field so that one will feel comfortable with the several aspects discussed in this thesis. Also, information on different techniques is given in order to provide for an understanding of my choice for the use of an inverted file based IR system as opposed to vector based or probabilistic matching. In chapter 4, the implementation will be discussed, followed by an evaluation of results obtained through this implementation (chapter 5). Finally, I will give a conclusion (chapter 6) that should give a satisfying answer to the question What advantages and disadvantages can be seen with respect to the use of a stemming-algorithm in an Information Retrieval system applied to a relatively small single-language document-collection?. 3

4 2. What is Information Retrieval? 2.1 First: a definition For a good formal definition of Information Retrieval I would like to quote a definition given in Baeza- Yates & Ribeiro-Neto (1990:p.1): Information Retrieval deals with the representation, storage, organisation of, and access to information items. The organisation and access of information items should provide the user with easy access to the information in which he is interested.. In the next paragraphs I will briefly explain why this definition incorporates all of the important features of a good Information Retrieval system. 2.1 History The history of Information Retrieval goes back as far as the start of paper archiving in an environment where the maintainers of a collection of information written on paper were not able to store all necessary information about the collection in their memories anymore. After 1940 this situation became more and more critical. Therefore a discussion started about how to store and retrieve documents in a document collection where the retrieval part did not mean going through all the text available in that collection (for that became a too time-consuming task because of the size of the document collection). The idea was that efforts should be made to store and retrieve documents from a system representing the complete document collection as well as possible whereas a search-query on that representative system would not have to be as costly as scanning through all the text within that document collection. The first systems that tried to do this were (naturally) designed in libraries where the problem of large document collections at first arose. These systems were paper-based themselves (i.e. no computerbased systems were available or even under construction at the time). They relied on extensive categorisation using a system of cards where on each card a short description in key-words or natural language was printed that would describe the document (the card represented) in the document collection as precisely as possible. These document cards were then categorised into more general categories in a recursive process until the top-categorisation level was reached. When searching for a document one would start making the first decision in what general top-category to begin, followed by travelling down the tree-structure into more specific categories until the cards of the documents in the category best specifying the query of the user were retrieved. These cards found under that category in the ideal case would exactly represent the subset of documents in the document collection the user was interested in. In many cases however, the classification would also lead to erroneous retrieval of documents that is the retrieval of documents the user was not interested in. 2.3 The rise of computer technology A lot of research since then has been done, starting with improving the paper-based systems and designing all kinds of algorithms and statistical tests for measuring the performance of the systems while along the way computers emerged as useful helpers. The gradual improvement of the performance of computers made the focus in time shift more towards implementing the algorithms and designing computer-programs focussed on automating the systems of storing and retrieving information. 2.4 Data Retrieval versus Information Retrieval Accompanying the progress made in the computer hardware industry, instead of focussing on the retrieval of data by matching the words in a query with those in the document collection, one started to focus more on the extraction of information. Here the distinction between data retrieval and information retrieval became important. In a data retrieval approach one matches the words entered by the user to the occurrence of these words in the documents searched. Only documents that contain these exact words will be returned. The returned result-set is unordered, so nothing can be said about the importance of a certain document in comparison to another. Also, these results in no way guarantee any semantic relevance 4

5 of the returned documents with respect to what the user is searching for, except for the fact that these documents contain the words (data) the user is interested in. The fact that the words themselves can have more meanings than one (ambiguity) is not accounted for. So, the results will on a data-level always match the search request, but due to ambiguity and a lack of any semantic knowledge about the data, the information in the result set may not match the users interest. An information retrieval system elaborates on the conceptual question a user has, and tries to answer this question by supplying the information most relevant to the user s information need. The techniques that perform this task try to interpret the query as a real question (a user wants to know something about a certain subject) instead of an order to retrieve specific data. After this interpretation the system can expand and/or modify the user s request so that it weighs the data and therefore can return documents weighed according to their presumed relevance to the user s query. There are different approaches to how one can weigh data and expand queries of which I will discuss the most common techniques in the next chapter. The main advantage with respect to the focus on information retrieval versus data retrieval is that a more informative result relevant to the user can be given instead of only physical matches to the words in the query. The notion of relevance therefore is the key word if you are dealing with the retrieval of information. 5

6 3. Concepts and definitions within the field of Information Retrieval Since there are quite a number of specific concepts and definitions within the field of IR, this chapter focuses on providing some basic knowledge about the meaning and use of different key concepts. 3.1 Recall and Precision When rating an IR system one has to have some guidelines as to what is important in measuring the overall performance of a system? Besides the trade-off between the time, space and effort it takes to build an index (designer) and to query it (designer and user), one has to be aware of what is important to the user, namely what kind of results are satisfactory? This is when two terms are introduced, the term recall, and the term precision Recall The notion of recall can be viewed as being the ability of the system to retrieve all relevant items matching a query to the user. It can be computed as follows: R = Ret Rel Tot Rel Where: R : Recall Ratio Re t Rel : the total number of items retrieved and relevant Tot Re l : the total number of relevant items in the collection Illustration of the terms used: Document Collection Total relevant Items Total Total retrieved Items Relevant & Retrieved Fig 3.1 6

7 Thus, if the system retrieves all the documents from the collection that are relevant to the query, the recall ratio is 1. Note that relevancy is a somewhat vague term. That is, one user for the same query might find some document relevant whereas another user might not. This makes it very hard to determine relevancy without using statistical measures over a group of users Precision Precision is a ratio that compares the number of relevant documents found to the total number of returned documents. It can be computed as follows: P = Ret Rel Tot Ret Where: P :Precision ratio Re t Rel : the total number if items retrieved and relevant Tot Re t : the total number of retrieved items See Illustration 3.1 for visualization of the terms used. Thus, when the result of a certain query contains only relevant items to this query, the precision ratio is 1. Again, the problem with relevancy is that it is not a fixed value that can be measured. It can be different for anyone using the system. 3.2 Term frequency and Inverse Document Frequency In the following paragraphs, term frequency (TF) and inverse document frequency (IDF) are frequently mentioned. I will hereby explain what they are Term frequency (TF) The Tf-factor is the relative frequency of a term in a document. This relative frequency is a measure for how well a certain term describes a document (intra-document characterisation). Since a term that occurs frequently in a certain document we assume that this word has some significant relation to the subject the document is about. It can be computed as follows: Let Freq i, j be the raw frequency of term i F i, j of term i F i, j = n l= 1 K in document Freq i, j Freq l, j D j is given by: K in the document Where n is the number of unique terms (tokens) in document D j. Then the normalized frequency D j Inverse document frequency (IDF) The Idf-factor is the inverse of the frequency of a term measured over all the documents in the collection. This value is a measure for how well a term is a good discriminator (inter-document characterization) for determining the relevance of a document. Even though a term might occur frequently in a certain document, if it occurs frequently in all documents than it is not very useful for distinguishing a relevant document from a non-relevant one. One can imagine that in a document collection about the history of the United States of America, the word america occurs frequent in 7

8 each document. The term may -in this case- indicate something about the subject of the document, but since all documents share (partly) this subject, it does not add any value to our discriminating process. It can be computed as follows: Let N be the total Number of documents in the system and the index term K i appears: Idf = log( N) log( N ) i i N i be the number of documents in which 3.3 Document pre-processing When indexing a document collection, one could assume that all the words in the documents are relevant as index-terms. This however is not the case for a lot of words. Articles for instance are so frequent in natural language that they will hardly distinct one document from the other in the searchprocess. For example, searching for the word the will probably retrieve the complete document collection. This is not desirable, and therefore these frequent words, also called stop-words, are better to be left out of the index, since they don t have any discriminating value. Also, consider punctuation characters, should they be included in the index? For example, if a query is formed as the president s wife, do you want to search for president s as an index-entry, or would a search for president yield the better answer? The first one will result in an exact match of what the user asked, but isn t the user actually interested in co-occurrences of president and wife? The last one returns some more documents concerning the words wife and president, even though the exact phrase president s wife might not be in the result-set. So recall is higher but at the cost of precision. So, instead of indexing all the words and characters as they are within the documents of the document collection, it is highly recommended that the text is pre-processed in order to determine which words and characters should be indexed in which should not. Pre-processing has several steps that can be considered. Not all of the steps are necessary for all systems, since the algorithms performing the tasks might reduce the speed of indexing and/or retrieval. Below I will make a summary together with a brief description on what steps can be taken. With each step I will argue whether the algorithm will be useful with respect to the IR system I am creating, taking into account the size of the document collection it is intended to serve Lexical analysis Lexical analysis of the texts that are to be indexed involves the identification of what characters are to be indexed, and what determines the boundaries for the terms. For instance, if an IR system has to handle only word queries, one would want to identify only all words within a document. However if phrases are to be indexed too, other boundaries might have to be used in order to identify phrases instead of words. Also, the treatment of special characters such as hyphens, digits, punctuation marks and spaces need to be carefully considered. In my implementation I have chosen to not index punctuation characters since they do not seem relevant to the subject of the history of the United States (the field of scope of my document collection), but I have indexed numbers, for historical dates are very likely to be of importance to users Elimination of stop-words Words like the and in are words likely to occur in almost all the documents in the document collection (if the texts are in English of course) and therefore can hardly provide for a distinction between documents concerning relevance. The best way to avoid retrieving too many documents that do not particularly match the user s query is to filter out these words (also called stop-words). One way to do this is to use the relative frequencies of the words within the documents as a threshold-value for determining whether a word has significant meaning to a document s subject or not. The words that 8

9 are natural candidates for a list of stop-words are articles, prepositions and conjunctions. An additional positive side-effect of filtering out the stop-words in the texts is that it reduces the size of the indexing structure considerably with amounts that can rise up to 40 percent of the initial size of the structure. Since the document collection I am working in is relatively small, I have decided not to make use of the elimination of stop-words. Also, it may prove interesting to see what happens if I compare a technique such as the tf-idf (to be explained in detail in paragraph 3.7.1) which incorporates the distribution of terms (hereby typically lowering weights for stop-words in a query) over the document collection to a technique that does not Stemming Because of the fact that certain words in the texts are derived from base words, some documents that contain relevant information according to the user s query might not be found because the exact word the user entered the query is not present in the document in the literal string. However, the word entered might be a derivation from a base word that might very well be present in the documents not found initially. For instance, a query that contains the word industrialization will without stemming only return documents actually containing this specific word whereas documents containing words as industry or industrial will not match even though they might very well contain interesting information about industrialization. In contrast, if the user enters a base word in his or her query, one can safely assume that derivations of this word in documents will also increase the relevance of these documents for the user. For instance, if a query contains the word industry, it is very likely that documents containing words such as industrialization and industrial are also of interest to the user whereas in a non-stemmed system these documents will not be returned. In order to achieve a system that can overcome both these inefficiencies the usage of a stemming algorithm can improve results significantly. A typical stemming algorithm removes prefixes and suffixes from words that are to be indexed in order to get to the base word this word is derived from. This base word then is indexed, and thus within the index, this word points to all occurrences of the base words and all of its derivations in the documents. By using this technique one can find related documents to a query instead of only the exact matches hereby ideally improving recall. The improvement of recall usually goes at a cost of a certain amount of precision. This is mainly because more documents are retrieved whereas the query is interpreted more generally by reducing query-words to their base-words. With the use of a stemming algorithm one reduces the size of the vocabulary and thus size of the index, which is also an advantage especially on larger document collections. The algorithm most widely used for the task of stemming is the Porter Stemming algorithm Baeza-Yates & Ribeiro-Neto (1990:p.433). This is also the one I have used in my implementation in order to perform my research with respect to my thesis question (see Chapter 1) Selection of index-terms When determining the importance of certain words within a text one can use different kinds of measures for what words are important factors in describing well the semantic content of a document. This determination of index-words can be done manually, automatically and through a combination of both. When manually determining the importance of words within texts it is usually so that someone, who is specialised within the field the documents are about, determines which words should be indexed and which not. This in itself is already a time-consuming task but on top of this, when new texts are added, this task needs to be re-performed every time the scope of the document-collection is expanded with new concepts and therefore new possible index-terms. Thus, there is a need for automation. Automatically determining the value of words as index terms can be done using the syntax and the semantics of the text. A technique that has proven to give good results is the technique of identifying noun groups Baeza-Yates & Ribeiro-Neto (1990:p.169). This technique involves stripping the text of words that are not nouns. The idea behind this is that noun words convey the most of the semantics of a text as opposed to word types such as verbs, adjectives, articles etc Instead of simply using these nouns as index terms, another process takes places beforehand. During stripping the non-noun words, we count the number of non-noun words that exist between two nouns. By defining a threshold (for instance 3 words may exist between two nouns) we can combine nouns that co-exist into one noun-group which in the end will be indexed as being relevant sets of index- 9

10 terms that represent a conceptual logical view of the documents. Since this technique is more based on a syntactic and semantic approach, whereas I am focussing more on the probabilistic ways of indexing, I have chosen to select index-terms based on the techniques described above neither manually, nor automatically. Instead, I will use frequency-information as a measure for termimportance Construction of term and document categorisation The construction of term categorisation is done for the purpose of being able to show all related documents to a query instead of only exact matches. By creating categories into which words can be sorted the search mechanism can expand the query the user feeds into the system into a broader query in order to find related documents. Two ways of categorisation are widely used at the moment, one is the construction of a Thesaurus and the other is use of document Clustering Thesaurus A thesaurus is a list of words that are considered to be the most important or significant within a certain domain of knowledge, and for each word in this list, a set of related words is mapped onto it. A thesaurus can be created manually. In this case it has to be done by someone who is an expert in the field of knowledge. This person will decide on what words should be used as index-words and in what category these words belong. When a search is done, the system will check in which category the query-terms occur, and is then capable of returning documents, which contain terms within the same category as the query-terms. The manual process of determining what words should be indexed, and in what category they belong however is a very time consuming task, especially when the document collection is dynamic and the content has to be updated frequently Document clustering Document clustering is a method that focuses on identifying similarities between the documents within a document collection based mostly on the importance of certain terms for certain conceptual classes of documents. A distinction has been made between local clustering methods and global clustering methods. Local clustering means that clustering is performed on the outcome of a query. The outcome is the subset of the document collection that matches a certain user query. This method is used a lot in relevance feedback cycles, where one finds the most relevant terms by focussing on the documents marked as relevant to the query by the user. For instance, if a query such as personal history of president Carter is entered, documents will be returned which contain either one, some of all of the words in the query. However, there may be many documents that say a lot about president Carter s political life but have nothing to do with his personal life. The user now can browse this result set and mark the documents that are about this president s personal life. The system will now be able to adjust the user query and add terms to it that best describe the content of the documents the user marked as being relevant. This new query will then be re-submitted to the system, resulting in a new set of matching documents. On this new set of documents, the user can again supply the information on what documents are most relevant. This cycle can go on until the user is satisfied with the documents returned. With every cycle, the relevance of the documents should improve. Because of queryexpansion, the recall does not have to suffer from the search for a higher relevance. Global clustering, instead of focusing on a subset of the document set, uses the complete document collection for clustering purposes. This technique is in most cases realised by creating a vector for each document within the collection after which the vectors are compared to one another and a degree of similarity can be computed. The degree of similarity then clusters documents into classes that should resemble certain subjects or fields of information. These global clusters can be used to present related documents to the documents retrieved by an initial query of a user. Ideally, this would provide the user with more information about the subject searched for, and thus the information that was not found initially according to the query. Both techniques have proven to improve precision significantly especially when they are combined. However, classification of documents, globally and locally, is a time consuming task. For global classification the vectors need to be recomputed every time a new document is added to the collection. Local classification perhaps provides a more precise answer to a query in the end, but the 10

11 computing and comparing of the vectors during search time is also a factor that slows down the search process. Since both of the above mentioned techniques imply a substantial increase in search-time and/or indexing-time I have chosen not to use them in my implementation. 3.4 The indexing process The process of indexing is a fundamental part of the creation of an IR system. Indexing means that instead of seeking matches in a document collection by searching linearly through the original documents we try to create a system that allows searching through a representation of this document collection. This representation then is called our index. The query submitted to the system should be in the same format as the index in order to be able to determine the matches between the query and the document set. This doesn t mean that the user has to formulate the query in this same format. It is common that after a user submits the query to the system, it is converted into the desired format by the system internally. After this conversion the reformatted query is compared to the index, and a result set of matching documents is retrieved and returned to the user in a user-friendly format. An index can be created using various distinct techniques of which I will only explain the ones most commonly used at the moment and which have proven themselves to deliver the best results. These are; inverted files, suffix trees and signature files. A short description of these techniques will be described below following paragraphs 8.2 and 8.3 of Baeza-Yates & Ribeiro-Neto (1990:p.203). While doing this, I will also briefly explain my choice for inverted files as the technique to be used for my IR system Inverted files Perhaps the most commonly used method for indexing is the use of inverted files. Inverted files are files that use a vocabulary over all documents in our collection in combination with pointers to the positions within the documents in which the words of the vocabulary occur (see Fig. 3.2). In this way one can search for a queried word in the vocabulary and when found use the list of pointers associated with this word to return the result-set of documents in which the word is found. In combination with Boolean operators one can combine these result-sets for each individual word in the query into a final result-set (unsorted yet). By using ranking algorithms one can measure how well certain documents match against a query. Given this information the documents in the result-set can be ranked according to their relevance so that the document best matching the user s query will be shown first in the list. The different ranking algorithms and techniques that exist will be discussed in paragraphs 3.6 and 3.7. Vocabulary: a an border broken common Index: A an border broken Etc Doc1;Doc2 Doc2;Doc3 Doc2;Doc4;Doc6 Doc3;Doc5 Fig

12 Inverted indices have the advantage of relatively easy implementation and a high performance speed. These two advantages however only hold when doing word searching in texts. For phrase and suffix searching this method has been proven less efficient than other methods Suffix Trees Using suffix trees is another way of indexing textual documents. A suffix tree was first presented in Manber & Myers (1990:pp ). A good explanation of the use of this technique is given in Baeza-Yates & Ribeiro-Neto (1990: p.199): This structure can be used to index only words (without stopwords) as the inverted index as well as to index any text character. This makes it suitable for a wider spectrum of applications, such as genetic databases. However, for word-based applications, inverted files perform better unless complex queries are an important issue. This index sees the text as one long string. Each position in the text is considered as a text suffix (i.e. a string that goes from that text position to the end of the text). It is not difficult to see that two suffixes starting at different positions are lexicographically different (assume that a character smaller than all the rest is placed at the end of the text). Each suffix is thus uniquely identified by its position. Not all text positions need to be indexed. Index points are selected from the text, which point to the beginning of the text positions that will be retrievable. For instance, it is possible to index only word beginnings to have a functionality similar to inverted indices. Those elements which are not index points are not retrievable (as in an inverted index it is not possible to retrieve the middle of a word). See fig. 3.3 as taken from Baeza-Yates & Ribeiro-Neto (1990: p.200)) for an illustration. This is a text. A text has many words. Words are made from letters. Text text. A text has many words. Words are made from letters. text has many words. Words are made from letters. many words. Words are made from letters. words. Words are made from letters. Suffixes Words are made from letters. made from letters. letters. Fig. 3.3 If we would index only word-beginnings, we would have functionality similar to inverted files. 12

13 This technique has the advantage that it can index complete words (as inverted files do) but is also capable of indexing any single character in the text. This feature allows for the use of genetic algorithms and also performs better during search-time when answering complex queries (such as phrase searching) as opposed to inverted files that do not have this advantage. The suffix tree can be composed using a trie data structure or an array data structure (see Fig. 3.4). The trie structure has as its disadvantage that the space it occupies lies between 120% and 240% overhead over the text size and thus is not very suitable if you are working on large text collections. To improve space utilization, this data structure can be compacted into a Patricia tree (more information on Patricia-trees can be found in Gonnet (1987) under the name of PAT-arrays). Compacting the suffix trie into a Patricia tree (see second illustration in fig. 3.4) involves compressing all unary paths, i.e. paths where each node has just one child. One can see the result of compressing these paths in figure 3.4 where the tree is illustrated. For example, one can see that, since there s only one word starting with a t (namely text) only the part of the suffix that is uniquely identifying the suffix (namely the last character in this case) is indexed after the t. This saves up considerable amounts of space. The technique using suffix arrays stores all pointers to the text-suffices in an array and is in space requirements about the same as inverted indices (i.e. close to 40% overhead over the text size). Although the suffix array can achieve about the same results in search time as inverted indices, and has the advantages of not degrading in search performance on phrases (complex queries), the building time of this structure is about 5 to 10 times slower than the one of inverted files. Suffix arrays occupy a more limited amount of space than the trie structured approach and therefore is the most widely used technique with respect to suffix searching. Since complex queries and phrase-searching are not my main focus and the costs of building the index is considerably higher (i.e. takes more space (suffix tries) or time (suffix arrays)) than the costs necessary when using inverted indexes, the techniques using suffix-indexing and searching may be interesting as a whole but do not outperform the technique of inverted files in my comparison on what technique to use for the indexing task to be performed. 13

14 This is a text. A text has many words. Words are made from letters. Text Suffix Trie l m d a n 28 t 19 e x t w o r d s. 33 Suffix Tree l d m 3 28 n t 19 5 w Compacted 33 Fig Signature files Signature files are a way of indexing based on hashing techniques. By means of a hashing function one can compute unique signatures for each term in the vocabulary. Then we can split up the documents into blocks, where for each block we compute a signature by disjunctively combining all the signatures of each term occurring in this block. If a word occurs in a block of text, then we expect that 14

15 all the bits that are set in the term-signature must also be set in the block-signature (see Fig. 3.5 taken from Baeza-Yates & Ribeiro-Neto (1990: p.206)). A query -with respect to this technique- is represented also as a signature. If we enter a query following the signatures shown in Fig with the word many, the query-signature will be This query-signature will now be matched against the document/block-signatures by means of bit wise AND-ing their values in order to obtain matches. The idea behind this is that a document matches a query if the outcome of the AND-operation on the document-signature and the query-signature has all the bit-values set that are also set in the query-signature. For example, the search for many will result in the AND-operations with their respective output depicted in Fig. 3.6 Block 1 Block 2 Block 3 Block 4 This is a text. A text has many words. Words are made from letters Text signature Signature-function : H(text) = H(many) = H(words) = H(made) = H(letters) = Fig. 3.5 Query: many Query-signature: Block1 -> AND = Block2 -> AND = Block3 -> AND = Block4 -> AND = We can now compare the resulting bit-signatures with the original query-signature. The only matching Block then turns out to be Block2 (which also has the first two bits set ). This is correct, since Block2 indeed is the only block that contains the word many. Fig. 3.6 The space overhead for this indexing technique is between 10 and 20 percent over the text size. This is the most positive feature of signature files, especially for larger document collection. Opposed to this however is the fact that searching has to be done sequentially which is an important disadvantage especially when searching in larger document collections. Thus, if space is more important than searching speed, this method proves itself very worthy, but otherwise (which is mostly the case, since a short search time is usually rated as one of the more important features of an IR system) inverted files outperform this method for most applications. Since the system I will discuss in the following chapters focuses more on speed of retrieval, precision and recall, instead of storing space, this technique also is not favourable compared to the technique of inverted files. 15

16 3.5 Searching: steps to be taken The searching process is dependent on the type of index created. Since I have chosen to use inverted files, I will limit my discussion to the search mechanisms, which search in inverted files. A classical inverted file system contains two files, the vocabulary file and the inverted file itself. The vocabulary file is used to perform a binary search (the vocabulary file is sorted alphabetically) for the occurrences of the words extracted from the query. The words in the vocabulary file are typically mapped to the position of the corresponding value for this word in the inverted file. The reason it is done this way is that the vocabulary even for large text collections can often be stored in main memory and therefore provides a powerful fast initial searching process. The following steps are used (see figure 3.7 (extended from fig. 3.2)) to get from the query to the result set of matching documents: Vocabulary search: The words in the query are isolated and searched for in the vocabulary. Retrieval of occurrences: The mappings found in the vocabulary are used to extract from the inverted file all of the documents that match the individual words together with possible discriminator values computed for this word within the document or document collection. Manipulation of occurrences: The individual occurrences of the words in the query are combined to solve phrases, proximity or Boolean operations (only Boolean operations are shown in this figure). After this process one set remains in which the documents that match the query are contained. Query: a AND border Vocabulary: a an border broken common Index: An Border Broken Etc Doc2;Doc3 Doc2;Doc4;Doc6 Doc3;Doc5 A Doc1;Doc2 Manipulation of occurrences: <Doc1;Doc2> 16

17 AND <Doc2;Doc3;Doc6> Result: <Doc2> Here, Document 2 is the only document found in which both terms occur, therefore it will be returned to the user as matching the query. Fig. 3.7 Once these steps are carried out we have a set containing the documents, which completely or partially match the query, but how can we express the distinctions between these documents in terms of relevance? This is a question that has led to many observations and implementations of possible ranking algorithms. To understand and make a good choice between these ranking algorithms, I will describe the algorithms most common in the field and elaborate on their advantages and disadvantages with respect to the task to be performed. 3.6 Searching: matching and ranking When querying an information retrieval system, one can choose from different strategies on how to perform the search on the document collection. Below I will describe some retrieval techniques currently known Boolean matching When retrieval systems were first devised for larger document collections it was done through punched cards and edge-notched cards that only held information in the form of bit representation. With this representation one could for instance make a signature card for each document and match a query against the collection of signatures through Boolean connectives such as AND, OR and NOT. This system of retrieval is called the Boolean method of retrieval and is based on set theory. Many systems still rely on this Boolean system (now using electronics instead of cards), but improvements, adjustments and additions have been made because of the fact that a Boolean system alone does not provide for good ranking. The Boolean technique is drawing a hard line between documents that are relevant and documents that are non-relevant to a query (either the document does contain one or several of the words mentioned in the query, or it doesn t). This makes it sensitive to mistakes in the form of not selecting documents which are conceptually equivalent to the query but which do not contain the specific words mentioned in the query. Also, when using the basic form of Boolean retrieval, there is no ranking possible between the documents that match the query if they contain the same number of query words. This means that if two documents both have two words that are in the query, they both qualify for being retrieved, but which one should be shown to the user with the higher priority? For these and other issues, the Boolean system has been combined with the other techniques in order to get the best of both worlds. This combination with other techniques is called the extended Boolean model. Although the Boolean technique in its basic form has been criticised since its inception, it is still used widely because of its ease of implementation and its simplicity in logic. Since combinations with probabilistic and weighting techniques have been tested, the extended Boolean technique is still one of the major techniques used and can compete against (or better; be combined with) other techniques such as proximity, fuzzy, probabilistic and vector based matching. I will implement an OR-based Boolean system as a basis for ranking my results. This means that with respect to the query, all matches to the separate terms occurring in this query will be OR-ed together to form a final set of (partly) matching documents. In addition to this I will use extended Boolean techniques that also incorporates term-frequency (see paragraph 3.2.1) and inverse document frequency (see paragraph 3.2.2) values in order to decide the relevance of terms to documents, and hereby the relevance of certain documents to the query containing these terms. These techniques will also account for a higher ranking of documents that contain all query-terms as opposed to those documents that contain only a few, one or zero of the query-terms. 17

18 3.6.2 Vector-based matching As the Boolean technique alone showed some deficiencies after its discovery and use in the early days of Information Retrieval, alternative methods of retrieval were developed. One of the first successful alternative systems was the SMART system as can be found in Lesk(1964). It was the first system showing the results of successful vector-based matching. There are various distinct techniques to implement vector based matching, and these techniques internally can make use of different similarity functions that discriminate between matching and non-matching documents (that is, how similar is a document to a query?). Two concepts can be recognised, the concept of vector based matching through metrics, and the concept of vector-based matching through an angular measure. Both concepts are described below Metrics- or distance-based If you perform vector-based matching through metrics, one may measure similarity by computing the distance between the documents and the query in the document space (see fig. 3.8). If two documents have a similarity distance of 0, the documents are equal or the same (the distance of a document to its self is always 0). Hence, the closer the distance between a document and a query in the documentspace, the more likely it is that this document matches the query well. Document-space in a Metrics-based system D2 Q1 D3 D1 D4 Here, the arrows represent the absolute distance (on whatever scale) between the Documents (D1..D4) and the Query (Q1). There is no fixed origin, and therefore the arrows are not to be seen as angular vectors, but as pure metrical distance-indicators. Fig Angle-based Another way of looking at vector-based matching is the idea of looking at the angles of the document vectors and the query-vector in the document space with respect to a certain origin (as illustrated in Fig 3.9). This method is called an angular measure, and one widely used example of this kind of measure is the cosine measure (as described in Wilkinson & Hingston (1991: pp )). This measure calculates the cosine of the angle between the vectors (with respect to the origin) representing the document and the query (or two documents, for document-to-document-similarity). Document-space in an Angle-based system y (x,y,z)-coords. : D2 D1 : (1, 2, 4) D2 : (3, 7, 4) Q1 D3 D3 : (7, 4, 1) D4 : (7, 1, 4) Q1 : (3, 5, 3) x 18

19 D1 D4 z Here, the arrows represent the vectors for the Documents (D1..D4) and the Query (Q1). They are originating from the Origin. The distance of the physical locations of the Documents and the Query are not important, only the angle of their vectors towards the Origin. Fig. 3.9 These measures (angle- and metric-based) analyse the document space differently and use different techniques to define similarity between documents and the query. The distance measure approach is looking at a group of documents and uses only the distance between the documents in document space, and therefore discards any fixed point onto which the values may be mapped. This makes this method intrinsic -- it only focuses on the internal relations between the objects. That is, from a given point in the document space, all directions are considered equal. However, the angular measure, because angles can only be computed from a fixed point, is extrinsic. If you change the indexing process (by changing for instance the term-weighting scheme), the origin from which the angles are measured can change. The documents can hereby be assigned a different point in the document space that will can lead to a different angle towards the origin. This means that the fixed point (origin) in this method is a determining factor in how the document similarity is computed. Additionally, since distances are not considered important in the angular measure, it is possible that two documents that have the same angle towards the origin are far apart in the document space. Whereas the angular measure would consider them highly similar, the distance measure would conclude the opposite considering the distance between them. If, for instance (according to Korfhage (1997:p.85)), there are three documents each described by the same two terms, with document vectors D 1 = <1,3> D 2 = <100,300>, and D = <3,1> 3 By the cosine measure (which can be found in Korfhage(1997:p.84))) the similarity between D 1 and D 2 would be 1.0 and the similarity between D 1 and D 3 would be 0.6. Using Euclidian distance, we see that the distance between D 1 and D 2 is and the distance between D 1 and D 3 is It can be argued that in D 1 and D 2 the two terms have the same relative importance; that is, that the ratio of their values is the same. However, D 3 is much more closer to D 1 than is D 2 in the document space. Should the ratio of the distance be the significant measure in this case? One could argue that it should not Probabilistic matching The basics of probabilistic matching are as follows. The documents in the collection all can be estimated to have a certain probability to be relevant or non-relevant with respect to a certain query. If this relevance is combined with the probability of query-terms occurring in these documents, a discrimination value can be computed. This discrimination value is the value that tells us whether a document has a higher probability of being relevant to a query compared to a randomly chosen document from the collection. If this is the case, than this document should be selected as part of the result set to be given to the user as answer to his or her query. Probabilistic matching is not a method that has proven to give significantly better results than vector based or Boolean techniques. Since the process of probabilistic matching is also very time consuming because of all the effort that has to be done in order to compute all probabilities I will not deeply go into this subject. 19

Indexing and Searching

Indexing and Searching Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 8 2. Information Retrieval:

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 9 2. Information Retrieval:

More information

Chapter 3 - Text. Management and Retrieval

Chapter 3 - Text. Management and Retrieval Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter 3 - Text Management and Retrieval Literature: Baeza-Yates, R.;

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Information Retrieval. Chap 7. Text Operations

Information Retrieval. Chap 7. Text Operations Information Retrieval Chap 7. Text Operations The Retrieval Process user need User Interface 4, 10 Text Text logical view Text Operations logical view 6, 7 user feedback Query Operations query Indexing

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION Ms. Nikita P.Katariya 1, Prof. M. S. Chaudhari 2 1 Dept. of Computer Science & Engg, P.B.C.E., Nagpur, India, nikitakatariya@yahoo.com 2 Dept.

More information

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014. A B S T R A C T International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Information Retrieval Models and Searching Methodologies: Survey Balwinder Saini*,Vikram Singh,Satish

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections

More information

CHAPTER 5 Querying of the Information Retrieval System

CHAPTER 5 Querying of the Information Retrieval System 5.1 Introduction CHAPTER 5 Querying of the Information Retrieval System Information search and retrieval involves finding out useful documents from a store of information. In any information search and

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval

More information

Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time:

Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time: English Student no:... Page 1 of 14 Contact during the exam: Geir Solskinnsbakk Phone: 735 94218/ 93607988 Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time:

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

Information Retrieval. Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent

More information

CMPSCI 646, Information Retrieval (Fall 2003)

CMPSCI 646, Information Retrieval (Fall 2003) CMPSCI 646, Information Retrieval (Fall 2003) Midterm exam solutions Problem CO (compression) 1. The problem of text classification can be described as follows. Given a set of classes, C = {C i }, where

More information

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4 Query Languages Berlin Chen 2005 Reference: 1. Modern Information Retrieval, chapter 4 Data retrieval Pattern-based querying The Kinds of Queries Retrieve docs that contains (or exactly match) the objects

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

Tag-based Social Interest Discovery

Tag-based Social Interest Discovery Tag-based Social Interest Discovery Xin Li / Lei Guo / Yihong (Eric) Zhao Yahoo!Inc 2008 Presented by: Tuan Anh Le (aletuan@vub.ac.be) 1 Outline Introduction Data set collection & Pre-processing Architecture

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Department of Electronic Engineering FINAL YEAR PROJECT REPORT Department of Electronic Engineering FINAL YEAR PROJECT REPORT BEngCE-2007/08-HCS-HCS-03-BECE Natural Language Understanding for Query in Web Search 1 Student Name: Sit Wing Sum Student ID: Supervisor:

More information

Information Retrieval

Information Retrieval Information Retrieval Natural Language Processing: Lecture 12 30.11.2017 Kairit Sirts Homework 4 things that seemed to work Bidirectional LSTM instead of unidirectional Change LSTM activation to sigmoid

More information

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

COMP6237 Data Mining Searching and Ranking

COMP6237 Data Mining Searching and Ranking COMP6237 Data Mining Searching and Ranking Jonathon Hare jsh2@ecs.soton.ac.uk Note: portions of these slides are from those by ChengXiang Cheng Zhai at UIUC https://class.coursera.org/textretrieval-001

More information

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents. Optimal Query Assume that the relevant set of documents C r are known. Then the best query is: q opt 1 C r d j C r d j 1 N C r d j C r d j Where N is the total number of documents. Note that even this

More information

Query Processing and Alternative Search Structures. Indexing common words

Query Processing and Alternative Search Structures. Indexing common words Query Processing and Alternative Search Structures CS 510 Winter 2007 1 Indexing common words What is the indexing overhead for a common term? I.e., does leaving out stopwords help? Consider a word such

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Inverted Indexes Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Basic Concepts Inverted index: a word-oriented mechanism for indexing a text collection to speed up the

More information

Similarity search in multimedia databases

Similarity search in multimedia databases Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:

More information

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems Indexing: Part IV CPS 216 Advanced Database Systems Announcements (February 17) 2 Homework #2 due in two weeks Reading assignments for this and next week The query processing survey by Graefe Due next

More information

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University CS377: Database Systems Text data and information retrieval Li Xiong Department of Mathematics and Computer Science Emory University Outline Information Retrieval (IR) Concepts Text Preprocessing Inverted

More information

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi. Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 18 Tries Today we are going to be talking about another data

More information

A Novel PAT-Tree Approach to Chinese Document Clustering

A Novel PAT-Tree Approach to Chinese Document Clustering A Novel PAT-Tree Approach to Chinese Document Clustering Kenny Kwok, Michael R. Lyu, Irwin King Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong

More information

Instructor: Stefan Savev

Instructor: Stefan Savev LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information

More information

Multimedia Information Systems

Multimedia Information Systems Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural

More information

INFORMATION RETRIEVAL SYSTEM USING FUZZY SET THEORY - THE BASIC CONCEPT

INFORMATION RETRIEVAL SYSTEM USING FUZZY SET THEORY - THE BASIC CONCEPT ABSTRACT INFORMATION RETRIEVAL SYSTEM USING FUZZY SET THEORY - THE BASIC CONCEPT BHASKAR KARN Assistant Professor Department of MIS Birla Institute of Technology Mesra, Ranchi The paper presents the basic

More information

CHAPTER 8 Multimedia Information Retrieval

CHAPTER 8 Multimedia Information Retrieval CHAPTER 8 Multimedia Information Retrieval Introduction Text has been the predominant medium for the communication of information. With the availability of better computing capabilities such as availability

More information

Indexing and Query Processing. What will we cover?

Indexing and Query Processing. What will we cover? Indexing and Query Processing CS 510 Winter 2007 1 What will we cover? Key concepts and terminology Inverted index structures Organization, creation, maintenance Compression Distribution Answering queries

More information

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 188 CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 6.1 INTRODUCTION Image representation schemes designed for image retrieval systems are categorized into two

More information

Lecture 5: Information Retrieval using the Vector Space Model

Lecture 5: Information Retrieval using the Vector Space Model Lecture 5: Information Retrieval using the Vector Space Model Trevor Cohn (tcohn@unimelb.edu.au) Slide credits: William Webber COMP90042, 2015, Semester 1 What we ll learn today How to take a user query

More information

(Refer Slide Time: 4:00)

(Refer Slide Time: 4:00) Principles of Programming Languages Dr. S. Arun Kumar Department of Computer Science & Engineering Indian Institute of Technology, Delhi Lecture - 38 Meanings Let us look at abstracts namely functional

More information

Linguistics and Philosophy 23: , Is Compositionality Formally Vacuous? Francis Jeffry Pelletier

Linguistics and Philosophy 23: , Is Compositionality Formally Vacuous? Francis Jeffry Pelletier Linguistics and Philosophy 23: 629-633, 1998 Is Compositionality Formally Vacuous? Ali Kazmi Dept. Philosophy Univ. Calgary Francis Jeffry Pelletier Dept. Philosophy Univ. Alberta We prove a theorem stating

More information

Component ranking and Automatic Query Refinement for XML Retrieval

Component ranking and Automatic Query Refinement for XML Retrieval Component ranking and Automatic uery Refinement for XML Retrieval Yosi Mass, Matan Mandelbrod IBM Research Lab Haifa 31905, Israel {yosimass, matan}@il.ibm.com Abstract ueries over XML documents challenge

More information

A Model for Information Retrieval Agent System Based on Keywords Distribution

A Model for Information Retrieval Agent System Based on Keywords Distribution A Model for Information Retrieval Agent System Based on Keywords Distribution Jae-Woo LEE Dept of Computer Science, Kyungbok College, 3, Sinpyeong-ri, Pocheon-si, 487-77, Gyeonggi-do, Korea It2c@koreaackr

More information

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table Indexing Web pages Web Search: Indexing Web Pages CPS 296.1 Topics in Database Systems Indexing the link structure AltaVista Connectivity Server case study Bharat et al., The Fast Access to Linkage Information

More information

Midterm Exam Search Engines ( / ) October 20, 2015

Midterm Exam Search Engines ( / ) October 20, 2015 Student Name: Andrew ID: Seat Number: Midterm Exam Search Engines (11-442 / 11-642) October 20, 2015 Answer all of the following questions. Each answer should be thorough, complete, and relevant. Points

More information

Web Search Engine Question Answering

Web Search Engine Question Answering Web Search Engine Question Answering Reena Pindoria Supervisor Dr Steve Renals Com3021 07/05/2003 This report is submitted in partial fulfilment of the requirement for the degree of Bachelor of Science

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 9 Indexing and Searching with Gonzalo Navarro Introduction Inverted Indexes Signature Files Suffix Trees and Suffix Arrays Sequential Searching Multi-dimensional Indexing

More information

Semantic Search in s

Semantic Search in  s Semantic Search in Emails Navneet Kapur, Mustafa Safdari, Rahul Sharma December 10, 2010 Abstract Web search technology is abound with techniques to tap into the semantics of information. For email search,

More information

Recap: lecture 2 CS276A Information Retrieval

Recap: lecture 2 CS276A Information Retrieval Recap: lecture 2 CS276A Information Retrieval Stemming, tokenization etc. Faster postings merges Phrase queries Lecture 3 This lecture Index compression Space estimation Corpus size for estimates Consider

More information

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts Kwangcheol Shin 1, Sang-Yong Han 1, and Alexander Gelbukh 1,2 1 Computer Science and Engineering Department, Chung-Ang University,

More information

This book is licensed under a Creative Commons Attribution 3.0 License

This book is licensed under a Creative Commons Attribution 3.0 License 6. Syntax Learning objectives: syntax and semantics syntax diagrams and EBNF describe context-free grammars terminal and nonterminal symbols productions definition of EBNF by itself parse tree grammars

More information

INFSCI 2140 Information Storage and Retrieval Lecture 6: Taking User into Account. Ad-hoc IR in text-oriented DS

INFSCI 2140 Information Storage and Retrieval Lecture 6: Taking User into Account. Ad-hoc IR in text-oriented DS INFSCI 2140 Information Storage and Retrieval Lecture 6: Taking User into Account Peter Brusilovsky http://www2.sis.pitt.edu/~peterb/2140-051/ Ad-hoc IR in text-oriented DS The context (L1) Querying and

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation

More information

Modern information retrieval

Modern information retrieval Modern information retrieval Modelling Saif Rababah 1 Introduction IR systems usually adopt index terms to process queries Index term: a keyword or group of selected words any word (more general) Stemming

More information

Information Retrieval and Data Mining Part 1 Information Retrieval

Information Retrieval and Data Mining Part 1 Information Retrieval Information Retrieval and Data Mining Part 1 Information Retrieval 2005/6, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Information Retrieval - 1 1 Today's Question 1. Information

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 3 Modeling Part I: Classic Models Introduction to IR Models Basic Concepts The Boolean Model Term Weighting The Vector Model Probabilistic Model Chap 03: Modeling,

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT Prof. Dipartimento di Elettronica e Informazione Politecnico di Milano INFORMATION SEARCH AND RETRIEVAL Inf. retrieval 1 PRESENTATION SCHEMA GOALS AND

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 05 Index Compression 1 05 Index Compression - Information Retrieval - 05 Index Compression 2 Last lecture index construction Sort-based indexing

More information

IJRIM Volume 2, Issue 2 (February 2012) (ISSN )

IJRIM Volume 2, Issue 2 (February 2012) (ISSN ) AN ENHANCED APPROACH TO OPTIMIZE WEB SEARCH BASED ON PROVENANCE USING FUZZY EQUIVALENCE RELATION BY LEMMATIZATION Divya* Tanvi Gupta* ABSTRACT In this paper, the focus is on one of the pre-processing technique

More information

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization

More information

SOFTWARE ENGINEERING Prof.N.L.Sarda Computer Science & Engineering IIT Bombay. Lecture #10 Process Modelling DFD, Function Decomp (Part 2)

SOFTWARE ENGINEERING Prof.N.L.Sarda Computer Science & Engineering IIT Bombay. Lecture #10 Process Modelling DFD, Function Decomp (Part 2) SOFTWARE ENGINEERING Prof.N.L.Sarda Computer Science & Engineering IIT Bombay Lecture #10 Process Modelling DFD, Function Decomp (Part 2) Let us continue with the data modeling topic. So far we have seen

More information

Optimal Clustering and Statistical Identification of Defective ICs using I DDQ Testing

Optimal Clustering and Statistical Identification of Defective ICs using I DDQ Testing Optimal Clustering and Statistical Identification of Defective ICs using I DDQ Testing A. Rao +, A.P. Jayasumana * and Y.K. Malaiya* *Colorado State University, Fort Collins, CO 8523 + PalmChip Corporation,

More information

Codify: Code Search Engine

Codify: Code Search Engine Codify: Code Search Engine Dimitriy Zavelevich (zavelev2) Kirill Varhavskiy (varshav2) Abstract: Codify is a vertical search engine focusing on searching code and coding problems due to it s ability to

More information

Organizing Information. Organizing information is at the heart of information science and is important in many other

Organizing Information. Organizing information is at the heart of information science and is important in many other Dagobert Soergel College of Library and Information Services University of Maryland College Park, MD 20742 Organizing Information Organizing information is at the heart of information science and is important

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures

More information

COMP 410 Lecture 1. Kyle Dewey

COMP 410 Lecture 1. Kyle Dewey COMP 410 Lecture 1 Kyle Dewey About Me I research automated testing techniques and their intersection with CS education My dissertation used logic programming extensively This is my second semester at

More information

Information Retrieval

Information Retrieval s Information Retrieval Information system management system Model Processing of queries/updates Queries Answer Access to stored data Patrick Lambrix Department of Computer and Information Science Linköpings

More information

CHAPTER-26 Mining Text Databases

CHAPTER-26 Mining Text Databases CHAPTER-26 Mining Text Databases 26.1 Introduction 26.2 Text Data Analysis and Information Retrieval 26.3 Basle Measures for Text Retrieval 26.4 Keyword-Based and Similarity-Based Retrieval 26.5 Other

More information

Relevance of a Document to a Query

Relevance of a Document to a Query Relevance of a Document to a Query Computing the relevance of a document to a query has four parts: 1. Computing the significance of a word within document D. 2. Computing the significance of word to document

More information

CS54701: Information Retrieval

CS54701: Information Retrieval CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful

More information

XDS An Extensible Structure for Trustworthy Document Content Verification Simon Wiseman CTO Deep- Secure 3 rd June 2013

XDS An Extensible Structure for Trustworthy Document Content Verification Simon Wiseman CTO Deep- Secure 3 rd June 2013 Assured and security Deep-Secure XDS An Extensible Structure for Trustworthy Document Content Verification Simon Wiseman CTO Deep- Secure 3 rd June 2013 This technical note describes the extensible Data

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

Authoritative K-Means for Clustering of Web Search Results

Authoritative K-Means for Clustering of Web Search Results Authoritative K-Means for Clustering of Web Search Results Gaojie He Master in Information Systems Submission date: June 2010 Supervisor: Kjetil Nørvåg, IDI Co-supervisor: Robert Neumayer, IDI Norwegian

More information

Information Retrieval: Retrieval Models

Information Retrieval: Retrieval Models CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models

More information

Patent Terminlogy Analysis: Passage Retrieval Experiments for the Intellecutal Property Track at CLEF

Patent Terminlogy Analysis: Passage Retrieval Experiments for the Intellecutal Property Track at CLEF Patent Terminlogy Analysis: Passage Retrieval Experiments for the Intellecutal Property Track at CLEF Julia Jürgens, Sebastian Kastner, Christa Womser-Hacker, and Thomas Mandl University of Hildesheim,

More information

XI International PhD Workshop OWD 2009, October Fuzzy Sets as Metasets

XI International PhD Workshop OWD 2009, October Fuzzy Sets as Metasets XI International PhD Workshop OWD 2009, 17 20 October 2009 Fuzzy Sets as Metasets Bartłomiej Starosta, Polsko-Japońska WyŜsza Szkoła Technik Komputerowych (24.01.2008, prof. Witold Kosiński, Polsko-Japońska

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

Information Extraction Techniques in Terrorism Surveillance

Information Extraction Techniques in Terrorism Surveillance Information Extraction Techniques in Terrorism Surveillance Roman Tekhov Abstract. The article gives a brief overview of what information extraction is and how it might be used for the purposes of counter-terrorism

More information

Programming in C++ Prof. Partha Pratim Das Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Programming in C++ Prof. Partha Pratim Das Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Programming in C++ Prof. Partha Pratim Das Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 08 Constants and Inline Functions Welcome to module 6 of Programming

More information

user.book Page 45 Friday, April 8, :05 AM Part 2 BASIC STRUCTURAL MODELING

user.book Page 45 Friday, April 8, :05 AM Part 2 BASIC STRUCTURAL MODELING user.book Page 45 Friday, April 8, 2005 10:05 AM Part 2 BASIC STRUCTURAL MODELING user.book Page 46 Friday, April 8, 2005 10:05 AM user.book Page 47 Friday, April 8, 2005 10:05 AM Chapter 4 CLASSES In

More information

Web Information Retrieval using WordNet

Web Information Retrieval using WordNet Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT

More information

Efficient subset and superset queries

Efficient subset and superset queries Efficient subset and superset queries Iztok SAVNIK Faculty of Mathematics, Natural Sciences and Information Technologies, University of Primorska, Glagoljaška 8, 5000 Koper, Slovenia Abstract. The paper

More information

Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term.

Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term. Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term. Question 1: (4 points) Shown below is a portion of the positional index in the format term: doc1: position1,position2

More information

Programming Languages Third Edition

Programming Languages Third Edition Programming Languages Third Edition Chapter 12 Formal Semantics Objectives Become familiar with a sample small language for the purpose of semantic specification Understand operational semantics Understand

More information

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval DCU @ CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval Walid Magdy, Johannes Leveling, Gareth J.F. Jones Centre for Next Generation Localization School of Computing Dublin City University,

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Ricardo Baeza-Yates Berthier Ribeiro-Neto ACM Press NewYork Harlow, England London New York Boston. San Francisco. Toronto. Sydney Singapore Hong Kong Tokyo Seoul Taipei. New

More information

Full Text Search in Multi-lingual Documents - A Case Study describing Evolution of the Technology At Spectrum Business Support Ltd.

Full Text Search in Multi-lingual Documents - A Case Study describing Evolution of the Technology At Spectrum Business Support Ltd. Full Text Search in Multi-lingual Documents - A Case Study describing Evolution of the Technology At Spectrum Business Support Ltd. This paper was presented at the ICADL conference December 2001 by Spectrum

More information

Using SportDiscus (and Other Databases)

Using SportDiscus (and Other Databases) Using SportDiscus (and Other Databases) Databases are at the heart of research. Google is a database, and it receives almost 6 billion searches every day. Believe it or not, however, there are better databases

More information

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression Web Information Retrieval Lecture 4 Dictionaries, Index Compression Recap: lecture 2,3 Stemming, tokenization etc. Faster postings merges Phrase queries Index construction This lecture Dictionary data

More information

Spatially-Aware Information Retrieval on the Internet

Spatially-Aware Information Retrieval on the Internet Spatially-Aware Information Retrieval on the Internet SPIRIT is funded by EU IST Programme Contract Number: Abstract Multi-Attribute Similarity Ranking Deliverable number: D17:5301 Deliverable type: R

More information