ABSTRACT. VENKATESH, JAYASHREE. Pairwise Document Similarity using an Incremental Approach to TF-IDF. (Under the direction of Dr. Christopher Healey.

Size: px
Start display at page:

Download "ABSTRACT. VENKATESH, JAYASHREE. Pairwise Document Similarity using an Incremental Approach to TF-IDF. (Under the direction of Dr. Christopher Healey."

Transcription

1 ABSTRACT VENKATESH, JAYASHREE. Pairwise Document Similarity using an Incremental Approach to TF-IDF. (Under the direction of Dr. Christopher Healey.) Advances in information and communication technologies offer ubiquitous access to vast amounts of information and are causing an exponential increase in the number of documents available online. While this can be highly beneficial for both humans and automated computer systems, this also means that we would now have to deal with real time operations with large, dynamic set of documents. While several algorithms for organization and retrieval of textual information exists, these algorithms require that the document set be available for calculation before analysis can begin. TF-IDF is one such algorithm which requires recalculation of the term weights assigned to all the terms on any change in the document set. We present a methodology to reduce the number of computations of the term weights in an environment where documents are continuously being added or removed. We evaluate the approach by computing its efficiency and accuracy on one its most important applications, computation of pairwise document similarity.

2 c Copyright 2010 by Jayashree Venkatesh All Rights Reserved

3 Pairwise Document Similarity using an Incremental Approach to TF-IDF by Jayashree Venkatesh A thesis submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements for the Degree of Master of Science Computer Science Raleigh, North Carolina 2010 APPROVED BY: Dr. Robert St. Amant Dr. Jon Doyle Dr. Christopher Healey Chair of Advisory Committee

4 DEDICATION To my parents. ii

5 BIOGRAPHY Jayashree Venkatesh was born on August 6th, 1984 in Bangalore, India. She received her Bachelor s degree in Computer Science from Visweswaraiah Technological University in June She later worked with Yahoo! Software Development Center, Bangalore, India mainly on Web 2.0 technology. Since 2008, she has been a Master s student in the Department of Computer Science at North Carolina State University. During summer and fall of 2009 she worked at Intel Inc., Folsom, California as an Intern Software Engineering. After completion of her Master s program requirements she will be joining Intel Inc., as a Graphics Software Engineer. iii

6 ACKNOWLEDGEMENTS I would like to express utmost gratitude to my advisor Dr. Christopher Healey for providing me the opportunity for working under him towards my Masters thesis. His insight, guidance and patience have been most valuable and I am grateful to him for the time he spent working with me on the draft of this report. I would also like to thank him for providing me the opportunity to intern at Intel. My gratitude to Dr. Robert St. Amant and Dr. Jon Doyle for serving on my thesis committee and for being exceptionally accommodating. Additionally, I would like to thank Ping-Lin Hsiao, Srinath Setty and Mohit Khanna for proofreading this document and providing valuable suggestions. Lastly, I would like to thank my roommates Preeti Syal and Krishna Bala Subramani for their constant support and encouragement. iv

7 TABLE OF CONTENTS List of Tables vii List of Figures viii Chapter 1 Introduction Contributions Thesis Structure Chapter 2 A survey of Information Retrieval Systems Basic Algorithmic Operations Document Representation Techniques Information Retrieval Models Classic Information Retrieval Models Boolean (exact-match) Model Vector Model Probabilistic Model Comparison of the Classic Models Alternative Set Theoretic Models Extended Boolean Model Fuzzy Set Model Alternate Algebraic Models Generalized Vector Space Model Latent Semantic Indexing Model Neural Network Model Alternative Probabilistic Models Inference Network Model Structured Document Retrieval Models Non-Overlapping Lists Model Proximal Nodes Model Chapter 3 Document Similarity using Incremental TF-IDF Term-Weighing, Vector Space Model and Cosine Similarity Choice of Content Terms Term Weight Specifications Document Preprocessing Lexical Analysis of the Text Removal of Stop Words or High Frequency Words Suffix Stripping by Stemming Incremental Approach for TF-IDF Recalculation Term Weights and TF-IDF Data Structures Document Similarity Computation v

8 3.3.3 Incremental TF-IDF Chapter 4 Results Handling Dynamic Document Collection Addition or Removal of Documents Calculating Document Similarity Difference Experimental Results Experiment Set Experiment Set Experiment Set Experiment Set Conclusion Chapter 5 Summary and Future Work Challenges and Future Directions References vi

9 LIST OF TABLES Table 3.1 Stop Words List Table 4.1 Experimental Results Summary vii

10 LIST OF FIGURES Figure 2.1 Extended Boolean Model: The map at the left shows the similarities of q or = (k x k y ) with documents d j and d j+1. The map at the right shows the similarities of q and = (k x k y ) with documents d j and d j+1. [6] Figure 2.2 Neural Network Model [31] Figure 2.3 Inference Network Model [15] Figure 2.4 Non-Overlapping Lists Model [6] Figure 2.5 Proximal Nodes Model [32] Figure 4.1 Graph of count ratio vs avg and max similarity difference for different document count difference thresholds Figure 4.2 Graph of count ratio vs avg and max similarity difference for different percentage error thresholds Figure 4.3 1) Count Ratio vs % Error Threshold 2) Average Similarity Difference vs % Error Threshold 3) Maximum Similarity Difference vs % Error Threshold 62 Figure 4.4 Graph of count ratio vs avg and max similarity difference for different percentage error thresholds Figure 4.5 1) Count Ratio vs % Error Threshold 2) Average Similarity Difference vs % Error Threshold 3) Maximum Similarity Difference vs % Error Threshold 66 Figure 4.6 Graph of count ratio vs avg and max similarity difference for T 2 = 0%.. 68 viii

11 Chapter 1 Introduction The abundance of information (in digital form) available in on-line repositories can be highly beneficial for both humans and automated computer systems that seek information. This however poses extremely difficult challenges due to the variety and amount of data available. One of the most challenging analysis problems in the data mining and information retrieval domains is organizing large amounts of information [1]. While several retrieval models have been proposed as the basis for organization and retrieval of textual information, Vector Space Model (VSM) has shown the most value due to its efficiency. VSM however has substantial computational complexity for large set of documents and every time the document collection changes, the basic algorithm requires a costly re-evaluation of the entire document set. While in some problem domains it is possible to know what the document collection is a-priori, it is not feasible in many real time applications with large, dynamic set of documents. VSM works by assigning term weights to the terms of the documents. Term weights allow computing a continuous degree of similarity between documents. The best known term weighing scheme uses the term frequency and inverse document frequency of a term and is called TF- IDF weighing scheme. Term frequency (TF) determines the importance of a term within the document in which it is found. The more frequently the term appears in the document, the 1

12 higher the value of its term frequency [2]. Zipf [3] pointed out that a term which appears in many documents in the collection is not useful for distinguishing a relevant document from a non-relevant one. To take this into account the inverse document frequency (IDF) factor came into existence which determines the inverse of the relative number of documents that contain the term. A number of variations of TF-IDF exists today but the underlying principle remains the same [4]. VSM assigns a term weight to all the terms in a document. Thus a document can be represented as a vector composed of its term weights. Since there are several documents in the collection, the dimensionality of the vector space is equal to the total number of unique terms in the document collection. Pairwise document similarity computation is an important application of VSM which is based on the dot product of document vectors. On each addition or removal of a document to the collection, the document vectors will change. Specifically: 1. For terms in the document being added or removed, their IDF will change because of changes in both the number of documents containing the term, and the total number of documents. 2. For terms not in the document being added or removed, their IDF will also change because the total number of documents in the collection changes. This means that with any change in the document set all the term weights associated with all the documents needs to be recalculated. To avoid recalculation of all the document vectors on each addition or removal of documents, we propose a mechanism to recalculate only a subset of the term weights and only when we estimate that the similarity error for deferring the calculation exceeds a predetermined threshold. 1.1 Contributions The aim of this thesis is to reduce the number of computations made to a real time dynamic document collection on addition or removal of documents. To achieve this we first develop a working model of the TF-IDF algorithm. Second, we make provisions for dynamic document 2

13 addition or removal. We then modify the TF-IDF algorithm to perform recalculations of the term weights of only those terms whose document frequency changes on any change in the document set. We describe this as an incremental approach because unlike the traditional approach which recalculates all the TF-IDF values, recalculation is done only for a smaller set of terms. Following this we conducted a series of experiments to determine how much reduction in the number of computations we achieved. For evaluation we use 1477 articles from various newspaper collections. We conducted our experiments on one of the most important applications of the VSM, pairwise document similarity calculation. We perform several steps of document addition, removal or addition and removal. After each step we record the number of computations and the pairwise document similarity values as calculated by the incremental approach. We then run through the same set of steps with the traditional approach and once again record the number of computations made along with the document similarity values. We compare these values to evaluate the performance of the incremental approach as compared to the traditional approach Thesis Structure In Chapter 2 we give a detailed background on the existing document similarity techniques. In Chapter 3 we discuss the design and implementation of the incremental TF-IDF approach. Chapter 4 presents the experimental results comparing the incremental approach to the traditional approach. Finally, Chapter 5 concludes the thesis and highlights directions for future work. 3

14 Chapter 2 A survey of Information Retrieval Systems The phenomenal growth in the variety and quantity of information available to users has resulted from the advances in electronic and computer technology. As a result users are often faced with the problem of reducing the amount of information to a manageable size, so that only relevant items can be examined. In Alvin Toffler s book Future Shock [5], Emilio Segre, Nobel prize winning physicist, is quoted as saying that on k-mesons alone, to wade through all the papers is an impossibility. This indicates that even in specialized, narrow topics information is growing enormously. Thus a great demand for efficient and effective ways to organize and search through all the information is needed. Information Retrieval (IR) is concerned with identifying documents in a collection that best match the user s information needs. IR is concerned with three important concepts: representation of information content, acquisition and representation of the information to be found and specifying matching functions to retrieve the relevant documents from the information collection. In their book on Modern Information Retrieval [6], Yates and Neto describe IR as a means to represent, organize, store, and access information items. In their book on Text Information Retrieval Systems [7], Meadow, Boyce and Kraft compare IR to a communication process, as a 4

15 means by which authors or creators of records communicate with the readers, indirectly with a possible time lag between the creation of a text and its delivery to the IR system user. IR can be subdivided into three main areas of research [8] which make up a considerable portion of the subject. Content analysis, information structures and evaluation. Content analysis is concerned with describing contents of document in a form suitable for computer processing. Luhn [2], used frequency count of words to determine which words in a document should be used to describe the document. Spark Jones [9], used association of keywords (group of related words) to come up with frequency co-occurrence. This describes the frequency with which words occur together in a document. Several other content analysis techniques are in use today. Information structures deals with the document representations. Most computer based IR systems store only a representation of a document. By representation of a document we mean a list of keywords (group of related words) or terms extracted from the document which are considered important. A retrieving function is executed on this document representation to retrieve the required information. The output is a set of citations or document representations. In some IR systems the user can change his request after a sample retrieval, thereby hoping to improve the subsequent retrieval run. This procedure is commonly referred to as relevance feedback in which the user can indicate documents which are relevant and which are nonrelevant. Organizing of files is produced by an automatic classification method. Good [10] and Fairthrone [11] were the pioneers of this automatic representation/classification of information. Evaluation of IR systems [8] has proved to be extremely difficult. Despite a large amount of work in this area, a general theory of evaluation has not emerged. Lesk and Salton [12] described a dichotomous scale which evaluate IR systems on recall (the portion of relevant documents retrieved) and precision(the portion of retrieved documents which are relevant). Today evaluation of IR systems is still done using recall and precision as scale. 5

16 2.1 Basic Algorithmic Operations Algorithmic issues arise in two aspects of IR systems: (1) representing objects (text/image/multimedia) in a form amenable to automated search, and (2) efficiently searching such representations [13]. First we shall focus on the various representations used for documents and information needs. We will then discuss the classic retrieval models for information retrieval Document Representation Techniques Three classic ideas pervade information retrieval systems for efficient document representation. They are indexing, negative dictionaries (also known as stop words ) and stemming [13]. A collection of documents is commonly referred to as a corpus. Indexing deals with storing the subsets of documents associated with different terms in the corpus. A simple query returns all documents which contain any of the query terms required. However, this approach leads to poor recall since a user generally requires a Boolean AND of the search terms and not the Boolean OR. To solve this issue we could retrieve the documents which match every query term and take an intersection of these set of documents. This approach would however process a lot more documents than what is returned as output. Hence, it is desirable for an efficient IR system to return a list of documents according to some ranking scheme based on the number of query terms the document contains. This however falls in the scope of retrieval models and will be discussed in the next section. A better indexing mechanism is to store position of each occurrence of the term in the document along with the documents which contain the term. This would be helpful in case of queries dealing with a particular occurence of the query terms. Thus the indexing algorithm should be capable of supporting complex queries such as string queries. This suggests that it is necessary for the algorithm to understand the underlying corpus. Techniques that exploit the term statistics in the corpus were thus designed [13]. The first step in this direction was the use of negative dictionaries (or stop words list). A stop word list is a list of words that occur commonly in the corpus and hence using them as an 6

17 index term is not a good idea. This is because a query containing a term from the stop words list would fetch almost all documents in the corpus. Prepositions and articles are commonly included in the negative dictionaries. However usage of this technique has certain trade offs because it becomes difficult to search for certain strings that contain only prepositions and articles. Also, the contents of a negative dictionary should be designed keeping the corpus in mind - for instance the word can is generally considered a stop word, but in the case of waste management and recycling it might be an important index term. Another important technique to reduce the number of index terms is the usage of a stemming algorithm. This approach reduces search and index terms to their etymological roots. For example, a search for educational could return all documents containing the the term education. Another approach could be to determine the lexical relationship between the terms in the query and the documents in which they occur [14]. If two terms appear adjacent to each other in a query and there are some documents which contain these two terms closer to each other (say within a distance of five-eight words), then the IR system could rank these documents higher than the others. For instance, a search for the term computer networks could return those documents first which contain these two terms closer to each other. Precision in text retrieval could be improved by using an approach called categorized search. In this approach each document is assigned to one or more categories. For instance, tags such as Arts, Business and Economy, Government, Education, Science and Technology etc. could be assigned to documents indicating the major category to which they belong to. Each of these categories are further subdivided into multiple sub-categories. After dividing a document based on its broader category and sub-categories, the very terms that indicate this document as belonging to a category or sub-category can now function as stop-words. For example: if the term painting was used to categorize a document as belonging to the category Art, then after categorizing this document the term paint/painting can be removed from the entire document as a stop-word. The choice of categories to improve the relevance of a document is an 7

18 important task. Raghavan [13] talks about three important steps to consider before categorizing a corpus. First, the choice of categories should be intuitive to the anticipated user population. Secondly, the choice of categories should lead to a balanced taxonomy, in that a small number of categories containing all the documents is not recommended. Thirdly, the choice of categories should span the corpus. Choosing two categories that are very similar to each other is not desirable and could lead to confusion to the user. It is also important to realize categorization of documents into clusters is not a static approach. In a corpus containing say news articles, the cluster may change as the focus of the news changes. Thus categorization should be dynamic and capable of supporting a rapid change in the documents contained in the cluster. Finally, we bring back the idea of relevance feedback discussed at the beginning of this chapter. The IR system returns as output a set of documents based on the user s query. The user then marks each of these documents as relevant or irrelevant. The system on receiving this input from the user now refines the query to fetch better results closer to what the user is seeking. This process leads to an increase in precision and recall Information Retrieval Models While information can be of several types text, multimedia, images etc., most of the information sought by the end user is in the textual form. Several retrieval models have been proposed as the basis for text retrieval systems. However three classic models in IR which are used widely are called the exact match (Boolean) model, the vector space model and the probabilistic model. Recent experiments have shown that the differences in these approaches can be explained as the differences in the estimation of probabilities, both in the initial search and during relevance feedback [15]. The exact match or the boolean model view documents and queries as sets of index terms. Thus as suggested in [16], this retrieval model is called set theoretic. The vector space model views documents and queries as vectors in a high dimensional vector space, and use distance 8

19 as a measure of similarity. Thus this model is said to be algebraic. In the probabilistic model, retrieval is viewed as the problem of estimating the probability that a document representation matches or satisfies a query. As the name indicates this model is said to be probabilistic. Several alternate modeling paradigms based on the classic models have been proposed in recent years. Fuzzy and extended boolean models have been developed as alternatives to the basic boolean approach. Generalized vector, latent semantic indexing and neural network models as alternatives to the algebraic models and inference networks and belief networks as alternatives to the probabilistic models [6]. While all these models deal with the text content found in documents several approaches dealing with the structure of the written text have also been proposed. These models are termed as structured models and two popular models in this category are the non-overlapping list model and the proximal nodes model. Yates and Neto [6], also discuss the models for browsing in their book on Modern Information Retrieval. They suggest that users of an information system might be engaged in one of the two categories: retrieval and browsing. The task of translating the user s information needs into a query in the language provided by the system is known as retrieval. All the models discussed earlier in this section are those for retrieval. Suppose the user has an interest which is poorly defined or inherently broad. For example, the user is interested in some documents about the solar system. In this case, the user interacts with the IR system interactively looking around the documents on the solar system. While looking through these documents the user may find interesting documents about galaxies, the black holes, asteroids in the solar system or even about the planet earth. Furthermore while reading about the planet earth he may wander off into documents on the green house effect, pollution on the earth and even into documents pertaining to pollution control organizations. Thus the user is said to be browsing the collection of documents. While this is also a process of information retrieval, the objectives involved in the task are not clearly defined in the beginning and the purpose of the task may keep changing during the interaction with the system. Several models for browsing also exist. These models 9

20 are not discussed in this chapter. 2.2 Classic Information Retrieval Models Boolean (exact-match) Model Boolean model is a simple retrieval model based on set theory and Boolean algebra. In the Boolean model a set of binary valued variables refer to the features that can be assigned to the documents. A document is an assignment of truth values to this set of feature variables; features which are correct descriptions of the document content are assigned true and all other features are assigned false. As a result the weights assigned to the feature variables are all binary i.e., weight w i,j associated with the pair (k i, d j ) is either 0 or 1. Here k i is the i th feature variable and d j corresponds to the j th document. In this model queries are specified as Boolean expressions involving the operators and, or and not. Any document with truth value assignment that matches the query expression is said to match the query and all others fail to match. The Boolean model predicts that each document is either relevant or irrelevant. There is no notion of partial match to the query conditions. The main advantages of the Boolean model are the clean formalism behind the model and its simplicity. The major drawbacks of this model is that its retrieval strategy is based on a binary decision criteria. A document is predicted to be either relevant or non-relevant. This notion of exact matching may lead to retrieval of too many or too few documents. Another drawback of this model is that it is not always simple to translate an information need into a Boolean expression. Despite these drawbacks the Boolean model is still the dominant model with commercial document database systems and provides a good starting point for those new to this field. 10

21 2.2.2 Vector Model The vector model [16, 17] realizes that the binary weighing scheme is too limiting and hence provides a framework where partial matching is possible. Thus the vector model works by assigning non-binary weights to the index terms in queries and in documents. These index term weights are now used to compute the degree of similarity between the documents in the corpus and also between the documents and the query. The similarity values are now used to sort the retrieved documents in decreasing order of the degree of similarity. Thus the vector model takes into consideration documents which match the query terms only partially. The documents retrieved as output now match the user s query more precisely than the documents retrieved in the Boolean model. In the vector model both the j th document d j and the query q are represented as a multidimensional vector. The vector model evaluates the similarity between the document d j and the query q. The correlation between the two vectors d j and q is quantified as the cosine of the angle between these two vectors. Thus similarity between the document vector and the query is given as: sim(d j, q) = = d j q d j q t w i,j w i,q i=1 t wi,j 2 t i=1 i=1 w 2 i,q (2.1) where d j and q correspond to the length of the vectors and t is the number of dimensions. While the query remains the same, the document space varies and hence normalization is essential. w i,j is weight associated with the pair (k i, d j ), as indicated for the Boolean model and is a positive non-binary term. The index terms in the query are also weighted. w i,q is the weight associated with the pair (k i, q) where k i is the weight of the i th term in the query q 11

22 (positive). Thus in this model both the documents and the query are represented as vectors with t index terms. The document vector is represented by d j = (w 1,j, w 2,j,..., w t,j ) and the query vector by q = (w 1,q, w 2,q,..., w t,q ). sim(d j, q) varies from 0 to +1 since both w i,j and w i,q are positive. The vector model ranks the documents according to their degree of similarity to the query instead of predicting whether the documents are relevant or not. Thus a document is retrieved even if it only partially matches a query. To prevent a large number of documents from being retrieved the user can specify a threshold for the similarity and only those documents which have a similarity value above the specified threshold will be retrieved. Index term weights for the vector space model can be calculated in many different ways. However the most effective term weighing approaches use a clustering technique as described by Salton, Wong and Yang [18]. Thus the IR problem can be viewed as one of clustering. Given a collection of objects C and a user query with specification of a set A of objects, the clustering approach deals with classifying the objects in the collection C as belonging to the set A or not. The clustering approach deals with two issues. First, to identify what features better describe the objects in the set A. Second, to identify what features distinguish the objects in the set A from those not in the set A. Thus the two issues are to determine the intra-cluster similarity and the inter-cluster dissimilarity [6]. According to Salton and McGill [19], intra-cluster similarity can be measured by the frequency of a term k i in the document d j. This is referred to as the term frequency (tf ) and is a measure of how well the term determines the documents contents. For intra-cluster dissimilarity Salton and McGill [19], specify an inverse document frequency measure (idf ). The idf factor indicates that the terms which appear in many documents are not very useful from distinguishing a relevant document from a non-relevant one. Thus an effective term weighing scheme tries to balance both these effects. Let D be the total number of documents in a collection. Let {d: k i d} be the number of documents in which the term k i appears. Let freq i,j be the frequency of the occurrence of the term k i in the document d j (i.e., the number of times the term k i is mentioned in the document 12

23 d j ). Then the normalized term frequency for the term k i in the document d j is given by tf i,j = freq i,j (2.2) t freq i,j p=1 where the denominator is the sum of the number of occurrences of all the terms in the document d j. If the term k i does not appear in the document d j then the term frequency tf i,j = 0. The inverse document frequency for the term k i is given by idf i = log D d: k i d (2.3) The best known term-weighing schemes use weights which use both the term frequency and the inverse document frequency. w i,j = tf i,j idf i (2.4) The above term-weighing scheme is called as the TF-IDF weighing scheme. Salton and Buckley [20], suggest several variations for the weight w i,j, using the same underlying principle of TF-IDF. However the above scheme is very efficient for most of the document collections. The vector space model has the following advantages: (1) Improved performance by usage of an efficient term-weighing scheme (2) Retrieval of documents which partially match a query instead of a total match (3) cosine similarity sorts documents according to their degree of similarity with the query, allowing users to choose documents above some threshold level of similarity. The main disadvantage of the vector space model is that it assumes the terms are all mutually independent. The lexical/term relationship between the documents is not considered. However the benefits of determining lexical relationship among terms is very collection specific and hence might not influence the algorithm performance in all scenarios. Hence the vector space model is simple, efficient and a highly popular retrieval model. 13

24 2.2.3 Probabilistic Model The probabilistic model was introduced by Robertson and Spark Jones [21], also known as the binary independence retrieval (BIR) model. The fundamental idea of this model is as follows. If the user query is known, there is a set of documents which contains exactly the relevant documents. This set of documents is known as the ideal answer set. Now, if the description of this ideal answer set is known, there is no problem in retrieving the set of relevant documents. However generally, the properties of this ideal answer set are not exactly known. Since the properties are unknown, at query time an initial guess is made as to what these properties could be. Through the probabilistic description of the ideal answer set, the first set of documents are retrieved. Now, the user through relevance feedback, looks at the retrieved documents to decide which documents are relevant and which ones are not. This helps in further improving the probabilistic description of the ideal answer set. Given a query q and document d j, the probabilistic model tries to find the probability that the user will find the document d j relevant. The model assumes that there is a subset of documents which the user prefers as the answer set to the query q. Let us call this ideal answer set as the set R. The probabilistic model assigns to each document d j as its similarity to the query q, the ratio P(d j being relevant to q)/p(d j non-relevant to q). This computes the odds that the document d j is relevant to the query q. Just as the boolean model, the index term weight variables for the probabilistic model are binary i.e., w i,j {0, 1}, w i,q {0, 1}. Query q is the subset of index terms. R is the set of relevant documents and R the set of irrelevant documents. Let P(R d j ) be the probability that the document d j is relevant to the query q and let P( R d j ) be the probability that the document d j is non-relevant to the query q. The similarity measure of the document d j with the query q is defined as sim(d j, q) = P (R d j ) P ( R d j ) (2.5) 14

25 Using Bayes rule, sim(d j, q) = P ( d j R) P (R) P ( d j R) P ( R) (2.6) where P ( d j R) stands for the probability of randomly selecting the document d j from the set of relevant documents R. P(R) is the probability that a document randomly selected from the entire collection is relevant. The meaning of P ( d j R) and P ( R) are analogous and complementary. P(R) and P ( R) are the same for all documents in the collection. Thus we can write sim(d j, q) P ( d j R) P ( d j R) (2.7) If we assume the independence of index terms (i.e., index terms are not related to each other), then we can rewrite Equation (2.7) as sim(d j, q) ( P (k i R)) ( P ( k i R)) ( g i ( d j )=1 P ( k i R)) ( g i ( d j )=0 g i ( d j )=1 g i ( d j )=0 P ( k i R)) (2.8) where P(k i R) stands for the probability that the index term k i is present in a document randomly selected from the set R. P( k i R) is the probability that the index term k i is not present in a document randomly selected from the set R. The meaning of P(k i R) and P( k i R) are analogous and complementary. Now, initially we do not know the set R of retrieved documents. Hence it is essential to compute the probabilities P (k i R) and P (k i R). We make the following assumptions (1) assume that P (k i R) is constant for all index terms (equal to 0.5) (2) also assume that the distribution of the index terms among the non-relevant documents can be approximated by the distribution of index terms among all the documents in the collection. Thus we have, 15

26 P (k i R) = 0.5 (2.9) P (k i R) = n i N (2.10) where n i is the number of documents which contain the index term k i and N is the total number of documents in the collection. Using the Equations (2.9) and (2.10) we can now retrieve the initial set of documents which contain the query q and provide an initial probabilistic ranking for them. From here we further try to improve the ranking as follows. Let V be the number of documents initially retrieved and ranked by the probabilistic model. Also, let V i be the subset of V which contains the index term k i. To improve the probabilistic ranking we need to improve P (k i R) and P (k i R). To achieve an improvement we now again make the following assumptions (1) approximation of P (k i R) can be achieved by the distribution of the index terms k i among the documents retrieved so far (2) approximation of P (k i R) can be done by considering that all non-retrieved documents are non-relevant. Now we can write, P (k i R) = V i V P (k i R) = n i V i N V (2.11) (2.12) Now, the above process can be repeated recursively, which will improve the probabilities P (k i R) and P (k i R). Thus the main advantage of the probabilistic model, is that the documents are ranked in the decreasing order of their probability of being relevant. The disadvantages of this approach are (1) The initial step of assuming the separation of documents into relevant and non-relevant sets (2) The model does not take into consideration either the term frequency of the index term in the document or the inverse document frequency of the index term in the collection (3) The model assumes the index terms are independent and have no relationship between them. 16

27 2.2.4 Comparison of the Classic Models From our discussion in the previous sections it is clear that the Boolean model is the weakest of all the classic models. Its main disadvantage being the inability to recognize the partial matches leading to poor performance. Experiments performed by Croft [15] suggest that the probabilistic model provides better performance than the vector model. However, experiments performed afterwards by Salton and Buckley [20] showed through several different measures that the vector model outperforms the probabilistic model with general collections. Thus the vector model is the most popular model used by researchers, practitioners and the web community. 2.3 Alternative Set Theoretic Models Two alternate set theoretic models are popular. Namely the fuzzy set model and the extended Boolean model. In this section we discuss these two models in brief Extended Boolean Model The extended Boolean model first appeared in the communications of the ACM article in 1983, by Salton, Fox and Wu [22]. In the boolean model for a query of the form q=k x k y, only a document containing both the index terms k x and k y are retrieved. There is no difference between a document which contains either the term k x or the term k y or neither of them. However, the extended Boolean model allows us to fetch even partially matching queries just as the vector space model. It combines both the vector space model and Boolean algebra to calculate the similarities between queries and documents. Consider the scenario where only two terms (k x and k y ) are present in the query. We can now map the documents and queries in a two-dimensional map as shown in Figure 2.1. Weights w 1 and w 2 are computed for the terms k x and k y respectively in the document d j. The weights can be computed using the tf-idf factors of the vector space model as follows. 17

28 Figure 2.1: Extended Boolean Model: The map at the left shows the similarities of q or = (k x k y ) with documents d j and d j+1. The map at the right shows the similarities of q and = (k x k y ) with documents d j and d j+1. [6] w 1 = tf x,j idf x max i idf i (2.13) where tf x,j is the term frequency (normalized) for the term k x in the document d j, idf x is the inverse document frequency for the term k x in the entire collection and idf i is the inverse document frequency for a generic term k i. Similarly, the weight for the term k y can be calculated. Now form the map shown at the left in Figure 2.1 we see that for the query q or (k x k y ), the point (0,0) is the spot to be avoided, and form the map shown at the right in Figure 2.1 for the query q and (k x k y ), the spot (1,1) is the most desirable. This indicates that for the query q or, we take the distance from the spot (0,0) as a measure of similarity and for the query q and, the complement of the distance from the spot (1,1) as the measure of similarity. Thus we arrive at the formula: w 2 sim(q or, d j ) = 1 + w2 2 2 (1 w1 ) sim(q and, d j ) = (1 w 2 ) 2 2 (2.14) (2.15) 18

29 The 2D Boolean model discussed above can be easily extended using Euclidean distances to include a document collection of a higher t-dimensional space. The p-norm model can be used to include not only Euclidean distances but also p-distances where 1 p. Thus a generalized disjunctive query is given by: q or = k 1 p k 2 p... p k m (2.16) The similarity between q or and d j is given by: sim(q or, d j ) = p w p 1,j + wp 2,j wp t,j t (2.17) A generalized conjunctive query is now given by: q and = k 1 p k 2 p... p k m (2.18) The similarity between q and and d j is given by: sim(q and, d j ) = 1 p (1 w1,j ) p + (1 w 2,j ) p (1 w t,j ) p t (2.19) More general queries such as q = (k 1 p k 2 ) p k 3 can easily be processed by grouping the operators in a predefined order. The parameter p can be varied between 1 and infinity to vary the p-norm ranking behavior from vector based ranking to that of a Boolean like ranking (fuzzy logic). Thus the extended Boolean model is quite a powerful model for information retrieval. Though it has not been used extensively for information retrieval, it may prove itself useful later Fuzzy Set Model The fuzzy retrieval model is based on the extended Boolean model and the fuzzy set theory. Let us first understand the basic concepts of the fuzzy set theory. Fuzzy sets are those whose 19

30 elements have varying degrees of membership. In classical set theory membership of an element is assessed in binary terms, an element either belongs or does not belong to the set. This is called as a crisply defined set with every element holding the value of either 0 or 1. Fuzzy set theory allows gradual assessment of the membership of elements in a set (instead of abrupt). Fuzzy set is described with the aid of a membership function which is valued in the real interval [0,1]. A fuzzy set is thus a pair {A, m} where A is a set and m : A [0, 1]. For each x A, m(x) is called the grade of membership of x in {A, m}. Let x A. Then x is not included in the fuzzy set {A, m} if m(x) = 0, x is said to be fully included if m(x) = 1 and x is called fuzzy member if 0 < m(x) < 1. The set {x A m(x) > 0} is called the support of {A, m} and the set {x A m(x) = 1} is called its kernel. There are two classical fuzzy retrieval models: Mixed Min and Max (MMM) and Paice Model. Mixed Min and Max Model (MMM) In MMM model each index term has a fuzzy set associated with it. A document s weight with respect to an index term A is the degree of membership of the document in the fuzzy set associated with A. Documents to be retrieved for a query of the form {A or B}, should be in the fuzzy set associated with the union of these two sets A and B. Similarly documents to be retrieved for a query of the form {A and B}, should be in the fuzzy set associated with the intersection of these two sets. Thus the similarity of a document to the or query is max(w A, w B ) and similarity to the and query is min(w A, w B ). The MMM model tries to soften the boolean operators by considering the query-document similarity to be a linear combination of the min and max document weights. Thus given a document d j and index term weights w 1, w 2,..., w t for terms k 1, k 2,..., k t and the queries: 20

31 q or = (k 1 or k 2 or... or k t ) (2.20) q and = (k 1 and k 2 and... and k t ) (2.21) the MMM model computes query doc similarity as follows: sim(q or, d j ) = C or1 max(w 1, w 2,..., w t ) + C or2 min(w 1, w 2,..., w t ) (2.22) sim(q and, d j ) = C and1 min(w 1, w 2,..., w t ) + C and2 max(w 1, w 2,..., w t ) (2.23) where C or1 and C or2 are the softness coefficients for the or operator, and C and1 and C and2 are the softness coefficients for the and operator [23]. For an or term, we would like to give more importance to the maximum term weights and for an and term, more importance to the minimum term weights. Thus we have C or1 > C or2 and C and1 > C and2. For simplicity we generally assume C or1 = 1 C or2 and C and1 = 1 C and2. Experiments conducted by Lee and Fox [24], show that best performance of the MMM model occurs with C and1 in the range [0.5, 0.8] and with C or1 > 0.2. The computational cost of the MMM model is generally low and retrieval effectives is generally better than the standard Boolean model. Paice Model The Paice model [25] is an extension to the MMM model. The MMM model considers only the maximum and minimum term weights, while the Paice model incorporates all the term weights when calculating the similarity. Thus, 21

32 t r i 1 w i sim(q, d j ) = (2.24) t i=1 r j 1 where r is a constant co-efficient and w i is the term weights arranged in ascending order for and queries and descending order for or queries. When t = 2 the Paice model shows the same behavior as the MMM model. Experiments by Lee and Fox [24], show that setting r = 0.1 for and queries and r = 0.7 for or queries gives good retrieval effectiveness. However, this method is more expensive when compared to the MMM model due to the fact that the term weights have to be sorted in ascending and descending order, depending on whether an and clause or an or clause is being j=1 considered. The MMM model only requires determination of min or max of a set of term weights hence can be done in O(t). The Paice model requires at least O(t logt) for the sorting algorithm along with more floating point calculations. Fuzzy set models have been mainly discussed in the literature dedicated to fuzzy theory and are not too popular among the information retrieval community. Also, majority of the experiments carried out has considered only smaller collections which make comparison difficult at this time. 2.4 Alternate Algebraic Models Three alternate algebraic models are discussed in this section, namely, generalized vector space model, latent semantic indexing model, and the neural network model Generalized Vector Space Model In conventional vector space model (VSM) proposed by Salton [16, 19], the index terms are basic vectors in a vector space and each query is represented as a linear combination of these vectors [26]. The IR retrieval process involves the use of both the query vectors and the 22

33 document vectors to compute a cosine similarity to rank the documents according to their degree of similarity with the query. The term frequency of the terms in a document is used to represent the components of the document vector. This model assumes that the term vectors are orthogonal i.e., for each pair of index terms k i and k j we have k i k j = 0. However, the terms in a document collection are generally correlated and an efficient IR model takes these term correlations into consideration. This interpretation led to the development of the generalized vector space model (GVSM) where the term vectors may be correlated and hence non-orthogonal. In GVSM, the queries are presented as a list of terms with their corresponding weights. Thus GVSM cannot ideally handle Boolean queries (of the form AND, OR or NOT ). However Wong, Ziarko, Raghavan and Wong [26], show that GVSM can also be extended to handle situations where Boolean expressions are used as queries. Let (k 1, k 2,..., k t ) be the set of index terms in a document collection. Let B 2 t be the set of all possible Boolean expressions (also the number of possible patterns of term co-occurrence) using these index terms and the operators AND, OR and NOT. To represent every possible Boolean expression in B 2 t as a vector in vector space, we need to have a set of basis vectors corresponding to a set of fundamental expressions which can be combined to generate any element of the Boolean algebra. This leads to the notion of an atomic expression or a minterm. A minterm in t literals (k 1, k 2,..., k t ) is a conjunction of literals where each term k i appears exactly once in either its complemented or uncomplemented form. Thus in all there are 2 t minterms. The conjunction of any two minterms is always false (zero) and a Boolean expression involving (k 1, k 2,..., k t ) can be expressed as a disjunction of the minterms. Let us define the following set of m i vectors 23

34 m 1 = (1, 0,..., 0, 0) (2.25) m 2 = (0, 1,..., 0, 0) (2.26). (2.27) m 2 t = (0, 0,..., 0, 1) (2.28) where each vector m i is associated with the respective minterm m i. Now given these basic vectors, the vector representation of any Boolean expression is given by the vector sum of the basic vectors. Notice that for all i j, m i m j = 0. Thus the set of m i vectors is pairwise orthonormal. If two vectors are not orthonormal then their corresponding Boolean expressions should have atleast one minterm in common. To determine an expression for the index term vector k i associated with the index term k i let us use m i2 t to denote the set of all atomic expressions. Each term k i is an element of Boolean algebra generated and can be expressed in the disjunctive normal form (sum of the vectors for all minterms) as: k i = m i1 OR m i2... OR m ip (2.29) where m ij s are the minterms in which k i is uncomplemented and 1 < j < 2 t. If we denote the set of minterms in Equation (2.29) as m i, we can define the term k i as or, k i = m r m i m r (2.30) where k i = c ir m r (2.31) 2 t i=1 24

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014. A B S T R A C T International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Information Retrieval Models and Searching Methodologies: Survey Balwinder Saini*,Vikram Singh,Satish

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 3 Modeling Part I: Classic Models Introduction to IR Models Basic Concepts The Boolean Model Term Weighting The Vector Model Probabilistic Model Chap 03: Modeling,

More information

Information Retrieval. Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search IR models: Boolean model IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Browsing boolean vector probabilistic

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Knowledge Engineering in Search Engines

Knowledge Engineering in Search Engines San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2012 Knowledge Engineering in Search Engines Yun-Chieh Lin Follow this and additional works at:

More information

Modern information retrieval

Modern information retrieval Modern information retrieval Modelling Saif Rababah 1 Introduction IR systems usually adopt index terms to process queries Index term: a keyword or group of selected words any word (more general) Stemming

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL Lim Bee Huang 1, Vimala Balakrishnan 2, Ram Gopal Raj 3 1,2 Department of Information System, 3 Department

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Adding Term Weight into Boolean Query and Ranking Facility to Improve the Boolean Retrieval Model

Adding Term Weight into Boolean Query and Ranking Facility to Improve the Boolean Retrieval Model Adding Term Weight into Boolean Query and Ranking Facility to Improve the Boolean Retrieval Model Jiayi Wu University of Windsor There are two major shortcomings of the Boolean Retrieval Model that has

More information

CHAPTER 5 Querying of the Information Retrieval System

CHAPTER 5 Querying of the Information Retrieval System 5.1 Introduction CHAPTER 5 Querying of the Information Retrieval System Information search and retrieval involves finding out useful documents from a store of information. In any information search and

More information

Information Retrieval: Retrieval Models

Information Retrieval: Retrieval Models CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models

More information

Web Information Retrieval using WordNet

Web Information Retrieval using WordNet Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT

More information

CS54701: Information Retrieval

CS54701: Information Retrieval CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer

More information

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 188 CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 6.1 INTRODUCTION Image representation schemes designed for image retrieval systems are categorized into two

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Similarity search in multimedia databases

Similarity search in multimedia databases Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:

More information

Reading group on Ontologies and NLP:

Reading group on Ontologies and NLP: Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 1 / 25 Te Reading group on Ontologies and NLP: Machine Learning in Automated Text Categorization, by Fabrizio Sebastianini.

More information

vector space retrieval many slides courtesy James Amherst

vector space retrieval many slides courtesy James Amherst vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

International ejournals

International ejournals Available online at www.internationalejournals.com International ejournals ISSN 0976 1411 International ejournal of Mathematics and Engineering 112 (2011) 1023-1029 ANALYZING THE REQUIREMENTS FOR TEXT

More information

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 1, January 2014,

More information

Multimedia Information Systems

Multimedia Information Systems Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A thesis Submitted to the faculty of the graduate school of the University of Minnesota by Vamshi Krishna Thotempudi In partial fulfillment of the requirements

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised

More information

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES DESIGN AND ANALYSIS OF ALGORITHMS Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES http://milanvachhani.blogspot.in USE OF LOOPS As we break down algorithm into sub-algorithms, sooner or later we shall

More information

F. Aiolli - Sistemi Informativi 2006/2007

F. Aiolli - Sistemi Informativi 2006/2007 Text Categorization Text categorization (TC - aka text classification) is the task of buiding text classifiers, i.e. sofware systems that classify documents from a domain D into a given, fixed set C =

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) ( ) Knowledge Discovery and Data Mining 1 (VO) (707.003) Data Matrices and Vector Space Model Denis Helic KTI, TU Graz Nov 6, 2014 Denis Helic (KTI, TU Graz) KDDM1 Nov 6, 2014 1 / 55 Big picture: KDDM Probability

More information

Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm

Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm K.Parimala, Assistant Professor, MCA Department, NMS.S.Vellaichamy Nadar College, Madurai, Dr.V.Palanisamy,

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

Combining PGMs and Discriminative Models for Upper Body Pose Detection

Combining PGMs and Discriminative Models for Upper Body Pose Detection Combining PGMs and Discriminative Models for Upper Body Pose Detection Gedas Bertasius May 30, 2014 1 Introduction In this project, I utilized probabilistic graphical models together with discriminative

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

Retrieval Evaluation. Hongning Wang

Retrieval Evaluation. Hongning Wang Retrieval Evaluation Hongning Wang CS@UVa What we have learned so far Indexed corpus Crawler Ranking procedure Research attention Doc Analyzer Doc Rep (Index) Query Rep Feedback (Query) Evaluation User

More information

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT Prof. Dipartimento di Elettronica e Informazione Politecnico di Milano INFORMATION SEARCH AND RETRIEVAL Inf. retrieval 1 PRESENTATION SCHEMA GOALS AND

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

Authoritative K-Means for Clustering of Web Search Results

Authoritative K-Means for Clustering of Web Search Results Authoritative K-Means for Clustering of Web Search Results Gaojie He Master in Information Systems Submission date: June 2010 Supervisor: Kjetil Nørvåg, IDI Co-supervisor: Robert Neumayer, IDI Norwegian

More information

A Graph Theoretic Approach to Image Database Retrieval

A Graph Theoretic Approach to Image Database Retrieval A Graph Theoretic Approach to Image Database Retrieval Selim Aksoy and Robert M. Haralick Intelligent Systems Laboratory Department of Electrical Engineering University of Washington, Seattle, WA 98195-2500

More information

Tag-based Social Interest Discovery

Tag-based Social Interest Discovery Tag-based Social Interest Discovery Xin Li / Lei Guo / Yihong (Eric) Zhao Yahoo!Inc 2008 Presented by: Tuan Anh Le (aletuan@vub.ac.be) 1 Outline Introduction Data set collection & Pre-processing Architecture

More information

Instructor: Stefan Savev

Instructor: Stefan Savev LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

EFFICIENT ATTACKS ON HOMOPHONIC SUBSTITUTION CIPHERS

EFFICIENT ATTACKS ON HOMOPHONIC SUBSTITUTION CIPHERS EFFICIENT ATTACKS ON HOMOPHONIC SUBSTITUTION CIPHERS A Project Report Presented to The faculty of the Department of Computer Science San Jose State University In Partial Fulfillment of the Requirements

More information

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY BHARAT SIGINAM IN

More information

VK Multimedia Information Systems

VK Multimedia Information Systems VK Multimedia Information Systems Mathias Lux, mlux@itec.uni-klu.ac.at This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Information Retrieval Basics: Agenda Vector

More information

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Department of Electronic Engineering FINAL YEAR PROJECT REPORT Department of Electronic Engineering FINAL YEAR PROJECT REPORT BEngCE-2007/08-HCS-HCS-03-BECE Natural Language Understanding for Query in Web Search 1 Student Name: Sit Wing Sum Student ID: Supervisor:

More information

Instantaneously trained neural networks with complex inputs

Instantaneously trained neural networks with complex inputs Louisiana State University LSU Digital Commons LSU Master's Theses Graduate School 2003 Instantaneously trained neural networks with complex inputs Pritam Rajagopal Louisiana State University and Agricultural

More information

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information

More information

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.25-30 Enhancing Clustering Results In Hierarchical Approach

More information

Duke University. Information Searching Models. Xianjue Huang. Math of the Universe. Hubert Bray

Duke University. Information Searching Models. Xianjue Huang. Math of the Universe. Hubert Bray Duke University Information Searching Models Xianjue Huang Math of the Universe Hubert Bray 24 July 2017 Introduction Information searching happens in our daily life, and even before the computers were

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Ranking Algorithms For Digital Forensic String Search Hits

Ranking Algorithms For Digital Forensic String Search Hits DIGITAL FORENSIC RESEARCH CONFERENCE Ranking Algorithms For Digital Forensic String Search Hits By Nicole Beebe and Lishu Liu Presented At The Digital Forensic Research Conference DFRWS 2014 USA Denver,

More information

60-538: Information Retrieval

60-538: Information Retrieval 60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are

More information

21. Search Models and UIs for IR

21. Search Models and UIs for IR 21. Search Models and UIs for IR INFO 202-10 November 2008 Bob Glushko Plan for Today's Lecture The "Classical" Model of Search and the "Classical" UI for IR Web-based Search Best practices for UIs in

More information

Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES DESIGN AND ANALYSIS OF ALGORITHMS Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES http://milanvachhani.blogspot.in USE OF LOOPS As we break down algorithm into sub-algorithms, sooner or later we shall

More information

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17 Information Retrieval Vannevar Bush Director of the Office of Scientific Research and Development (1941-1947) Vannevar Bush,1890-1974 End of WW2 - what next big challenge for scientists? 1 Historic Vision

More information

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1 Acknowledgements This lecture series has been sponsored by the European

More information

Contents 1. INTRODUCTION... 3

Contents 1. INTRODUCTION... 3 Contents 1. INTRODUCTION... 3 2. WHAT IS INFORMATION RETRIEVAL?... 4 2.1 FIRST: A DEFINITION... 4 2.1 HISTORY... 4 2.3 THE RISE OF COMPUTER TECHNOLOGY... 4 2.4 DATA RETRIEVAL VERSUS INFORMATION RETRIEVAL...

More information

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University CS490W Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti] Clustering Document clustering Motivations Document

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

XI International PhD Workshop OWD 2009, October Fuzzy Sets as Metasets

XI International PhD Workshop OWD 2009, October Fuzzy Sets as Metasets XI International PhD Workshop OWD 2009, 17 20 October 2009 Fuzzy Sets as Metasets Bartłomiej Starosta, Polsko-Japońska WyŜsza Szkoła Technik Komputerowych (24.01.2008, prof. Witold Kosiński, Polsko-Japońska

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

A Framework for adaptive focused web crawling and information retrieval using genetic algorithms

A Framework for adaptive focused web crawling and information retrieval using genetic algorithms A Framework for adaptive focused web crawling and information retrieval using genetic algorithms Kevin Sebastian Dept of Computer Science, BITS Pilani kevseb1993@gmail.com 1 Abstract The web is undeniably

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Data Analytics and Boolean Algebras

Data Analytics and Boolean Algebras Data Analytics and Boolean Algebras Hans van Thiel November 28, 2012 c Muitovar 2012 KvK Amsterdam 34350608 Passeerdersstraat 76 1016 XZ Amsterdam The Netherlands T: + 31 20 6247137 E: hthiel@muitovar.com

More information

Semantic text features from small world graphs

Semantic text features from small world graphs Semantic text features from small world graphs Jurij Leskovec 1 and John Shawe-Taylor 2 1 Carnegie Mellon University, USA. Jozef Stefan Institute, Slovenia. jure@cs.cmu.edu 2 University of Southampton,UK

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Information Retrieval Potsdam, 14 June 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book Outline 2 1 Introduction 2 Indexing Block Document

More information

Modelling Structures in Data Mining Techniques

Modelling Structures in Data Mining Techniques Modelling Structures in Data Mining Techniques Ananth Y N 1, Narahari.N.S 2 Associate Professor, Dept of Computer Science, School of Graduate Studies- JainUniversity- J.C.Road, Bangalore, INDIA 1 Professor

More information

Information Retrieval and Data Mining Part 1 Information Retrieval

Information Retrieval and Data Mining Part 1 Information Retrieval Information Retrieval and Data Mining Part 1 Information Retrieval 2005/6, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Information Retrieval - 1 1 Today's Question 1. Information

More information

Semantic Search in s

Semantic Search in  s Semantic Search in Emails Navneet Kapur, Mustafa Safdari, Rahul Sharma December 10, 2010 Abstract Web search technology is abound with techniques to tap into the semantics of information. For email search,

More information

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization

More information

A Content Vector Model for Text Classification

A Content Vector Model for Text Classification A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.

More information

Chapter 3 - Text. Management and Retrieval

Chapter 3 - Text. Management and Retrieval Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter 3 - Text Management and Retrieval Literature: Baeza-Yates, R.;

More information

Ranking models in Information Retrieval: A Survey

Ranking models in Information Retrieval: A Survey Ranking models in Information Retrieval: A Survey R.Suganya Devi Research Scholar Department of Computer Science and Engineering College of Engineering, Guindy, Chennai, Tamilnadu, India Dr D Manjula Professor

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

Making Retrieval Faster Through Document Clustering

Making Retrieval Faster Through Document Clustering R E S E A R C H R E P O R T I D I A P Making Retrieval Faster Through Document Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-02 January 23, 2004 D a l l e M o l l e I n s t i t u t e

More information

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE 15 : CONCEPT AND SCOPE 15.1 INTRODUCTION Information is communicated or received knowledge concerning a particular fact or circumstance. Retrieval refers to searching through stored information to find

More information

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction CHAPTER 5 SUMMARY AND CONCLUSION Chapter 1: Introduction Data mining is used to extract the hidden, potential, useful and valuable information from very large amount of data. Data mining tools can handle

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Structural and Syntactic Pattern Recognition

Structural and Syntactic Pattern Recognition Structural and Syntactic Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent

More information

Information Retrieval

Information Retrieval s Information Retrieval Information system management system Model Processing of queries/updates Queries Answer Access to stored data Patrick Lambrix Department of Computer and Information Science Linköpings

More information