ABSTRACT. VENKATESH, JAYASHREE. Pairwise Document Similarity using an Incremental Approach to TF-IDF. (Under the direction of Dr. Christopher Healey.

Size: px

Start display at page:

Download "ABSTRACT. VENKATESH, JAYASHREE. Pairwise Document Similarity using an Incremental Approach to TF-IDF. (Under the direction of Dr. Christopher Healey."

Rhoda Hardy
5 years ago
Views:

1 ABSTRACT VENKATESH, JAYASHREE. Pairwise Document Similarity using an Incremental Approach to TF-IDF. (Under the direction of Dr. Christopher Healey.) Advances in information and communication technologies offer ubiquitous access to vast amounts of information and are causing an exponential increase in the number of documents available online. While this can be highly beneficial for both humans and automated computer systems, this also means that we would now have to deal with real time operations with large, dynamic set of documents. While several algorithms for organization and retrieval of textual information exists, these algorithms require that the document set be available for calculation before analysis can begin. TF-IDF is one such algorithm which requires recalculation of the term weights assigned to all the terms on any change in the document set. We present a methodology to reduce the number of computations of the term weights in an environment where documents are continuously being added or removed. We evaluate the approach by computing its efficiency and accuracy on one its most important applications, computation of pairwise document similarity.

3 Pairwise Document Similarity using an Incremental Approach to TF-IDF by Jayashree Venkatesh A thesis submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements for the Degree of Master of Science Computer Science Raleigh, North Carolina 2010 APPROVED BY: Dr. Robert St. Amant Dr. Jon Doyle Dr. Christopher Healey Chair of Advisory Committee

4 DEDICATION To my parents. ii

5 BIOGRAPHY Jayashree Venkatesh was born on August 6th, 1984 in Bangalore, India. She received her Bachelor s degree in Computer Science from Visweswaraiah Technological University in June She later worked with Yahoo! Software Development Center, Bangalore, India mainly on Web 2.0 technology. Since 2008, she has been a Master s student in the Department of Computer Science at North Carolina State University. During summer and fall of 2009 she worked at Intel Inc., Folsom, California as an Intern Software Engineering. After completion of her Master s program requirements she will be joining Intel Inc., as a Graphics Software Engineer. iii

6 ACKNOWLEDGEMENTS I would like to express utmost gratitude to my advisor Dr. Christopher Healey for providing me the opportunity for working under him towards my Masters thesis. His insight, guidance and patience have been most valuable and I am grateful to him for the time he spent working with me on the draft of this report. I would also like to thank him for providing me the opportunity to intern at Intel. My gratitude to Dr. Robert St. Amant and Dr. Jon Doyle for serving on my thesis committee and for being exceptionally accommodating. Additionally, I would like to thank Ping-Lin Hsiao, Srinath Setty and Mohit Khanna for proofreading this document and providing valuable suggestions. Lastly, I would like to thank my roommates Preeti Syal and Krishna Bala Subramani for their constant support and encouragement. iv

7 TABLE OF CONTENTS List of Tables vii List of Figures viii Chapter 1 Introduction Contributions Thesis Structure Chapter 2 A survey of Information Retrieval Systems Basic Algorithmic Operations Document Representation Techniques Information Retrieval Models Classic Information Retrieval Models Boolean (exact-match) Model Vector Model Probabilistic Model Comparison of the Classic Models Alternative Set Theoretic Models Extended Boolean Model Fuzzy Set Model Alternate Algebraic Models Generalized Vector Space Model Latent Semantic Indexing Model Neural Network Model Alternative Probabilistic Models Inference Network Model Structured Document Retrieval Models Non-Overlapping Lists Model Proximal Nodes Model Chapter 3 Document Similarity using Incremental TF-IDF Term-Weighing, Vector Space Model and Cosine Similarity Choice of Content Terms Term Weight Specifications Document Preprocessing Lexical Analysis of the Text Removal of Stop Words or High Frequency Words Suffix Stripping by Stemming Incremental Approach for TF-IDF Recalculation Term Weights and TF-IDF Data Structures Document Similarity Computation v

8 3.3.3 Incremental TF-IDF Chapter 4 Results Handling Dynamic Document Collection Addition or Removal of Documents Calculating Document Similarity Difference Experimental Results Experiment Set Experiment Set Experiment Set Experiment Set Conclusion Chapter 5 Summary and Future Work Challenges and Future Directions References vi

9 LIST OF TABLES Table 3.1 Stop Words List Table 4.1 Experimental Results Summary vii

10 LIST OF FIGURES Figure 2.1 Extended Boolean Model: The map at the left shows the similarities of q or = (k x k y ) with documents d j and d j+1. The map at the right shows the similarities of q and = (k x k y ) with documents d j and d j+1. [6] Figure 2.2 Neural Network Model [31] Figure 2.3 Inference Network Model [15] Figure 2.4 Non-Overlapping Lists Model [6] Figure 2.5 Proximal Nodes Model [32] Figure 4.1 Graph of count ratio vs avg and max similarity difference for different document count difference thresholds Figure 4.2 Graph of count ratio vs avg and max similarity difference for different percentage error thresholds Figure 4.3 1) Count Ratio vs % Error Threshold 2) Average Similarity Difference vs % Error Threshold 3) Maximum Similarity Difference vs % Error Threshold 62 Figure 4.4 Graph of count ratio vs avg and max similarity difference for different percentage error thresholds Figure 4.5 1) Count Ratio vs % Error Threshold 2) Average Similarity Difference vs % Error Threshold 3) Maximum Similarity Difference vs % Error Threshold 66 Figure 4.6 Graph of count ratio vs avg and max similarity difference for T 2 = 0%.. 68 viii

11 Chapter 1 Introduction The abundance of information (in digital form) available in on-line repositories can be highly beneficial for both humans and automated computer systems that seek information. This however poses extremely difficult challenges due to the variety and amount of data available. One of the most challenging analysis problems in the data mining and information retrieval domains is organizing large amounts of information [1]. While several retrieval models have been proposed as the basis for organization and retrieval of textual information, Vector Space Model (VSM) has shown the most value due to its efficiency. VSM however has substantial computational complexity for large set of documents and every time the document collection changes, the basic algorithm requires a costly re-evaluation of the entire document set. While in some problem domains it is possible to know what the document collection is a-priori, it is not feasible in many real time applications with large, dynamic set of documents. VSM works by assigning term weights to the terms of the documents. Term weights allow computing a continuous degree of similarity between documents. The best known term weighing scheme uses the term frequency and inverse document frequency of a term and is called TF- IDF weighing scheme. Term frequency (TF) determines the importance of a term within the document in which it is found. The more frequently the term appears in the document, the 1

12 higher the value of its term frequency [2]. Zipf [3] pointed out that a term which appears in many documents in the collection is not useful for distinguishing a relevant document from a non-relevant one. To take this into account the inverse document frequency (IDF) factor came into existence which determines the inverse of the relative number of documents that contain the term. A number of variations of TF-IDF exists today but the underlying principle remains the same [4]. VSM assigns a term weight to all the terms in a document. Thus a document can be represented as a vector composed of its term weights. Since there are several documents in the collection, the dimensionality of the vector space is equal to the total number of unique terms in the document collection. Pairwise document similarity computation is an important application of VSM which is based on the dot product of document vectors. On each addition or removal of a document to the collection, the document vectors will change. Specifically: 1. For terms in the document being added or removed, their IDF will change because of changes in both the number of documents containing the term, and the total number of documents. 2. For terms not in the document being added or removed, their IDF will also change because the total number of documents in the collection changes. This means that with any change in the document set all the term weights associated with all the documents needs to be recalculated. To avoid recalculation of all the document vectors on each addition or removal of documents, we propose a mechanism to recalculate only a subset of the term weights and only when we estimate that the similarity error for deferring the calculation exceeds a predetermined threshold. 1.1 Contributions The aim of this thesis is to reduce the number of computations made to a real time dynamic document collection on addition or removal of documents. To achieve this we first develop a working model of the TF-IDF algorithm. Second, we make provisions for dynamic document 2

13 addition or removal. We then modify the TF-IDF algorithm to perform recalculations of the term weights of only those terms whose document frequency changes on any change in the document set. We describe this as an incremental approach because unlike the traditional approach which recalculates all the TF-IDF values, recalculation is done only for a smaller set of terms. Following this we conducted a series of experiments to determine how much reduction in the number of computations we achieved. For evaluation we use 1477 articles from various newspaper collections. We conducted our experiments on one of the most important applications of the VSM, pairwise document similarity calculation. We perform several steps of document addition, removal or addition and removal. After each step we record the number of computations and the pairwise document similarity values as calculated by the incremental approach. We then run through the same set of steps with the traditional approach and once again record the number of computations made along with the document similarity values. We compare these values to evaluate the performance of the incremental approach as compared to the traditional approach Thesis Structure In Chapter 2 we give a detailed background on the existing document similarity techniques. In Chapter 3 we discuss the design and implementation of the incremental TF-IDF approach. Chapter 4 presents the experimental results comparing the incremental approach to the traditional approach. Finally, Chapter 5 concludes the thesis and highlights directions for future work. 3

14 Chapter 2 A survey of Information Retrieval Systems The phenomenal growth in the variety and quantity of information available to users has resulted from the advances in electronic and computer technology. As a result users are often faced with the problem of reducing the amount of information to a manageable size, so that only relevant items can be examined. In Alvin Toffler s book Future Shock [5], Emilio Segre, Nobel prize winning physicist, is quoted as saying that on k-mesons alone, to wade through all the papers is an impossibility. This indicates that even in specialized, narrow topics information is growing enormously. Thus a great demand for efficient and effective ways to organize and search through all the information is needed. Information Retrieval (IR) is concerned with identifying documents in a collection that best match the user s information needs. IR is concerned with three important concepts: representation of information content, acquisition and representation of the information to be found and specifying matching functions to retrieve the relevant documents from the information collection. In their book on Modern Information Retrieval [6], Yates and Neto describe IR as a means to represent, organize, store, and access information items. In their book on Text Information Retrieval Systems [7], Meadow, Boyce and Kraft compare IR to a communication process, as a 4

15 means by which authors or creators of records communicate with the readers, indirectly with a possible time lag between the creation of a text and its delivery to the IR system user. IR can be subdivided into three main areas of research [8] which make up a considerable portion of the subject. Content analysis, information structures and evaluation. Content analysis is concerned with describing contents of document in a form suitable for computer processing. Luhn [2], used frequency count of words to determine which words in a document should be used to describe the document. Spark Jones [9], used association of keywords (group of related words) to come up with frequency co-occurrence. This describes the frequency with which words occur together in a document. Several other content analysis techniques are in use today. Information structures deals with the document representations. Most computer based IR systems store only a representation of a document. By representation of a document we mean a list of keywords (group of related words) or terms extracted from the document which are considered important. A retrieving function is executed on this document representation to retrieve the required information. The output is a set of citations or document representations. In some IR systems the user can change his request after a sample retrieval, thereby hoping to improve the subsequent retrieval run. This procedure is commonly referred to as relevance feedback in which the user can indicate documents which are relevant and which are nonrelevant. Organizing of files is produced by an automatic classification method. Good [10] and Fairthrone [11] were the pioneers of this automatic representation/classification of information. Evaluation of IR systems [8] has proved to be extremely difficult. Despite a large amount of work in this area, a general theory of evaluation has not emerged. Lesk and Salton [12] described a dichotomous scale which evaluate IR systems on recall (the portion of relevant documents retrieved) and precision(the portion of retrieved documents which are relevant). Today evaluation of IR systems is still done using recall and precision as scale. 5

16 2.1 Basic Algorithmic Operations Algorithmic issues arise in two aspects of IR systems: (1) representing objects (text/image/multimedia) in a form amenable to automated search, and (2) efficiently searching such representations [13]. First we shall focus on the various representations used for documents and information needs. We will then discuss the classic retrieval models for information retrieval Document Representation Techniques Three classic ideas pervade information retrieval systems for efficient document representation. They are indexing, negative dictionaries (also known as stop words ) and stemming [13]. A collection of documents is commonly referred to as a corpus. Indexing deals with storing the subsets of documents associated with different terms in the corpus. A simple query returns all documents which contain any of the query terms required. However, this approach leads to poor recall since a user generally requires a Boolean AND of the search terms and not the Boolean OR. To solve this issue we could retrieve the documents which match every query term and take an intersection of these set of documents. This approach would however process a lot more documents than what is returned as output. Hence, it is desirable for an efficient IR system to return a list of documents according to some ranking scheme based on the number of query terms the document contains. This however falls in the scope of retrieval models and will be discussed in the next section. A better indexing mechanism is to store position of each occurrence of the term in the document along with the documents which contain the term. This would be helpful in case of queries dealing with a particular occurence of the query terms. Thus the indexing algorithm should be capable of supporting complex queries such as string queries. This suggests that it is necessary for the algorithm to understand the underlying corpus. Techniques that exploit the term statistics in the corpus were thus designed [13]. The first step in this direction was the use of negative dictionaries (or stop words list). A stop word list is a list of words that occur commonly in the corpus and hence using them as an 6

17 index term is not a good idea. This is because a query containing a term from the stop words list would fetch almost all documents in the corpus. Prepositions and articles are commonly included in the negative dictionaries. However usage of this technique has certain trade offs because it becomes difficult to search for certain strings that contain only prepositions and articles. Also, the contents of a negative dictionary should be designed keeping the corpus in mind - for instance the word can is generally considered a stop word, but in the case of waste management and recycling it might be an important index term. Another important technique to reduce the number of index terms is the usage of a stemming algorithm. This approach reduces search and index terms to their etymological roots. For example, a search for educational could return all documents containing the the term education. Another approach could be to determine the lexical relationship between the terms in the query and the documents in which they occur [14]. If two terms appear adjacent to each other in a query and there are some documents which contain these two terms closer to each other (say within a distance of five-eight words), then the IR system could rank these documents higher than the others. For instance, a search for the term computer networks could return those documents first which contain these two terms closer to each other. Precision in text retrieval could be improved by using an approach called categorized search. In this approach each document is assigned to one or more categories. For instance, tags such as Arts, Business and Economy, Government, Education, Science and Technology etc. could be assigned to documents indicating the major category to which they belong to. Each of these categories are further subdivided into multiple sub-categories. After dividing a document based on its broader category and sub-categories, the very terms that indicate this document as belonging to a category or sub-category can now function as stop-words. For example: if the term painting was used to categorize a document as belonging to the category Art, then after categorizing this document the term paint/painting can be removed from the entire document as a stop-word. The choice of categories to improve the relevance of a document is an 7

18 important task. Raghavan [13] talks about three important steps to consider before categorizing a corpus. First, the choice of categories should be intuitive to the anticipated user population. Secondly, the choice of categories should lead to a balanced taxonomy, in that a small number of categories containing all the documents is not recommended. Thirdly, the choice of categories should span the corpus. Choosing two categories that are very similar to each other is not desirable and could lead to confusion to the user. It is also important to realize categorization of documents into clusters is not a static approach. In a corpus containing say news articles, the cluster may change as the focus of the news changes. Thus categorization should be dynamic and capable of supporting a rapid change in the documents contained in the cluster. Finally, we bring back the idea of relevance feedback discussed at the beginning of this chapter. The IR system returns as output a set of documents based on the user s query. The user then marks each of these documents as relevant or irrelevant. The system on receiving this input from the user now refines the query to fetch better results closer to what the user is seeking. This process leads to an increase in precision and recall Information Retrieval Models While information can be of several types text, multimedia, images etc., most of the information sought by the end user is in the textual form. Several retrieval models have been proposed as the basis for text retrieval systems. However three classic models in IR which are used widely are called the exact match (Boolean) model, the vector space model and the probabilistic model. Recent experiments have shown that the differences in these approaches can be explained as the differences in the estimation of probabilities, both in the initial search and during relevance feedback [15]. The exact match or the boolean model view documents and queries as sets of index terms. Thus as suggested in [16], this retrieval model is called set theoretic. The vector space model views documents and queries as vectors in a high dimensional vector space, and use distance 8

19 as a measure of similarity. Thus this model is said to be algebraic. In the probabilistic model, retrieval is viewed as the problem of estimating the probability that a document representation matches or satisfies a query. As the name indicates this model is said to be probabilistic. Several alternate modeling paradigms based on the classic models have been proposed in recent years. Fuzzy and extended boolean models have been developed as alternatives to the basic boolean approach. Generalized vector, latent semantic indexing and neural network models as alternatives to the algebraic models and inference networks and belief networks as alternatives to the probabilistic models [6]. While all these models deal with the text content found in documents several approaches dealing with the structure of the written text have also been proposed. These models are termed as structured models and two popular models in this category are the non-overlapping list model and the proximal nodes model. Yates and Neto [6], also discuss the models for browsing in their book on Modern Information Retrieval. They suggest that users of an information system might be engaged in one of the two categories: retrieval and browsing. The task of translating the user s information needs into a query in the language provided by the system is known as retrieval. All the models discussed earlier in this section are those for retrieval. Suppose the user has an interest which is poorly defined or inherently broad. For example, the user is interested in some documents about the solar system. In this case, the user interacts with the IR system interactively looking around the documents on the solar system. While looking through these documents the user may find interesting documents about galaxies, the black holes, asteroids in the solar system or even about the planet earth. Furthermore while reading about the planet earth he may wander off into documents on the green house effect, pollution on the earth and even into documents pertaining to pollution control organizations. Thus the user is said to be browsing the collection of documents. While this is also a process of information retrieval, the objectives involved in the task are not clearly defined in the beginning and the purpose of the task may keep changing during the interaction with the system. Several models for browsing also exist. These models 9

20 are not discussed in this chapter. 2.2 Classic Information Retrieval Models Boolean (exact-match) Model Boolean model is a simple retrieval model based on set theory and Boolean algebra. In the Boolean model a set of binary valued variables refer to the features that can be assigned to the documents. A document is an assignment of truth values to this set of feature variables; features which are correct descriptions of the document content are assigned true and all other features are assigned false. As a result the weights assigned to the feature variables are all binary i.e., weight w i,j associated with the pair (k i, d j ) is either 0 or 1. Here k i is the i th feature variable and d j corresponds to the j th document. In this model queries are specified as Boolean expressions involving the operators and, or and not. Any document with truth value assignment that matches the query expression is said to match the query and all others fail to match. The Boolean model predicts that each document is either relevant or irrelevant. There is no notion of partial match to the query conditions. The main advantages of the Boolean model are the clean formalism behind the model and its simplicity. The major drawbacks of this model is that its retrieval strategy is based on a binary decision criteria. A document is predicted to be either relevant or non-relevant. This notion of exact matching may lead to retrieval of too many or too few documents. Another drawback of this model is that it is not always simple to translate an information need into a Boolean expression. Despite these drawbacks the Boolean model is still the dominant model with commercial document database systems and provides a good starting point for those new to this field. 10

21 2.2.2 Vector Model The vector model [16, 17] realizes that the binary weighing scheme is too limiting and hence provides a framework where partial matching is possible. Thus the vector model works by assigning non-binary weights to the index terms in queries and in documents. These index term weights are now used to compute the degree of similarity between the documents in the corpus and also between the documents and the query. The similarity values are now used to sort the retrieved documents in decreasing order of the degree of similarity. Thus the vector model takes into consideration documents which match the query terms only partially. The documents retrieved as output now match the user s query more precisely than the documents retrieved in the Boolean model. In the vector model both the j th document d j and the query q are represented as a multidimensional vector. The vector model evaluates the similarity between the document d j and the query q. The correlation between the two vectors d j and q is quantified as the cosine of the angle between these two vectors. Thus similarity between the document vector and the query is given as: sim(d j, q) = = d j q d j q t w i,j w i,q i=1 t wi,j 2 t i=1 i=1 w 2 i,q (2.1) where d j and q correspond to the length of the vectors and t is the number of dimensions. While the query remains the same, the document space varies and hence normalization is essential. w i,j is weight associated with the pair (k i, d j ), as indicated for the Boolean model and is a positive non-binary term. The index terms in the query are also weighted. w i,q is the weight associated with the pair (k i, q) where k i is the weight of the i th term in the query q 11

22 (positive). Thus in this model both the documents and the query are represented as vectors with t index terms. The document vector is represented by d j = (w 1,j, w 2,j,..., w t,j ) and the query vector by q = (w 1,q, w 2,q,..., w t,q ). sim(d j, q) varies from 0 to +1 since both w i,j and w i,q are positive. The vector model ranks the documents according to their degree of similarity to the query instead of predicting whether the documents are relevant or not. Thus a document is retrieved even if it only partially matches a query. To prevent a large number of documents from being retrieved the user can specify a threshold for the similarity and only those documents which have a similarity value above the specified threshold will be retrieved. Index term weights for the vector space model can be calculated in many different ways. However the most effective term weighing approaches use a clustering technique as described by Salton, Wong and Yang [18]. Thus the IR problem can be viewed as one of clustering. Given a collection of objects C and a user query with specification of a set A of objects, the clustering approach deals with classifying the objects in the collection C as belonging to the set A or not. The clustering approach deals with two issues. First, to identify what features better describe the objects in the set A. Second, to identify what features distinguish the objects in the set A from those not in the set A. Thus the two issues are to determine the intra-cluster similarity and the inter-cluster dissimilarity [6]. According to Salton and McGill [19], intra-cluster similarity can be measured by the frequency of a term k i in the document d j. This is referred to as the term frequency (tf ) and is a measure of how well the term determines the documents contents. For intra-cluster dissimilarity Salton and McGill [19], specify an inverse document frequency measure (idf ). The idf factor indicates that the terms which appear in many documents are not very useful from distinguishing a relevant document from a non-relevant one. Thus an effective term weighing scheme tries to balance both these effects. Let D be the total number of documents in a collection. Let {d: k i d} be the number of documents in which the term k i appears. Let freq i,j be the frequency of the occurrence of the term k i in the document d j (i.e., the number of times the term k i is mentioned in the document 12

23 d j ). Then the normalized term frequency for the term k i in the document d j is given by tf i,j = freq i,j (2.2) t freq i,j p=1 where the denominator is the sum of the number of occurrences of all the terms in the document d j. If the term k i does not appear in the document d j then the term frequency tf i,j = 0. The inverse document frequency for the term k i is given by idf i = log D d: k i d (2.3) The best known term-weighing schemes use weights which use both the term frequency and the inverse document frequency. w i,j = tf i,j idf i (2.4) The above term-weighing scheme is called as the TF-IDF weighing scheme. Salton and Buckley [20], suggest several variations for the weight w i,j, using the same underlying principle of TF-IDF. However the above scheme is very efficient for most of the document collections. The vector space model has the following advantages: (1) Improved performance by usage of an efficient term-weighing scheme (2) Retrieval of documents which partially match a query instead of a total match (3) cosine similarity sorts documents according to their degree of similarity with the query, allowing users to choose documents above some threshold level of similarity. The main disadvantage of the vector space model is that it assumes the terms are all mutually independent. The lexical/term relationship between the documents is not considered. However the benefits of determining lexical relationship among terms is very collection specific and hence might not influence the algorithm performance in all scenarios. Hence the vector space model is simple, efficient and a highly popular retrieval model. 13

24 2.2.3 Probabilistic Model The probabilistic model was introduced by Robertson and Spark Jones [21], also known as the binary independence retrieval (BIR) model. The fundamental idea of this model is as follows. If the user query is known, there is a set of documents which contains exactly the relevant documents. This set of documents is known as the ideal answer set. Now, if the description of this ideal answer set is known, there is no problem in retrieving the set of relevant documents. However generally, the properties of this ideal answer set are not exactly known. Since the properties are unknown, at query time an initial guess is made as to what these properties could be. Through the probabilistic description of the ideal answer set, the first set of documents are retrieved. Now, the user through relevance feedback, looks at the retrieved documents to decide which documents are relevant and which ones are not. This helps in further improving the probabilistic description of the ideal answer set. Given a query q and document d j, the probabilistic model tries to find the probability that the user will find the document d j relevant. The model assumes that there is a subset of documents which the user prefers as the answer set to the query q. Let us call this ideal answer set as the set R. The probabilistic model assigns to each document d j as its similarity to the query q, the ratio P(d j being relevant to q)/p(d j non-relevant to q). This computes the odds that the document d j is relevant to the query q. Just as the boolean model, the index term weight variables for the probabilistic model are binary i.e., w i,j {0, 1}, w i,q {0, 1}. Query q is the subset of index terms. R is the set of relevant documents and R the set of irrelevant documents. Let P(R d j ) be the probability that the document d j is relevant to the query q and let P( R d j ) be the probability that the document d j is non-relevant to the query q. The similarity measure of the document d j with the query q is defined as sim(d j, q) = P (R d j ) P ( R d j ) (2.5) 14

25 Using Bayes rule, sim(d j, q) = P ( d j R) P (R) P ( d j R) P ( R) (2.6) where P ( d j R) stands for the probability of randomly selecting the document d j from the set of relevant documents R. P(R) is the probability that a document randomly selected from the entire collection is relevant. The meaning of P ( d j R) and P ( R) are analogous and complementary. P(R) and P ( R) are the same for all documents in the collection. Thus we can write sim(d j, q) P ( d j R) P ( d j R) (2.7) If we assume the independence of index terms (i.e., index terms are not related to each other), then we can rewrite Equation (2.7) as sim(d j, q) ( P (k i R)) ( P ( k i R)) ( g i ( d j )=1 P ( k i R)) ( g i ( d j )=0 g i ( d j )=1 g i ( d j )=0 P ( k i R)) (2.8) where P(k i R) stands for the probability that the index term k i is present in a document randomly selected from the set R. P( k i R) is the probability that the index term k i is not present in a document randomly selected from the set R. The meaning of P(k i R) and P( k i R) are analogous and complementary. Now, initially we do not know the set R of retrieved documents. Hence it is essential to compute the probabilities P (k i R) and P (k i R). We make the following assumptions (1) assume that P (k i R) is constant for all index terms (equal to 0.5) (2) also assume that the distribution of the index terms among the non-relevant documents can be approximated by the distribution of index terms among all the documents in the collection. Thus we have, 15

26 P (k i R) = 0.5 (2.9) P (k i R) = n i N (2.10) where n i is the number of documents which contain the index term k i and N is the total number of documents in the collection. Using the Equations (2.9) and (2.10) we can now retrieve the initial set of documents which contain the query q and provide an initial probabilistic ranking for them. From here we further try to improve the ranking as follows. Let V be the number of documents initially retrieved and ranked by the probabilistic model. Also, let V i be the subset of V which contains the index term k i. To improve the probabilistic ranking we need to improve P (k i R) and P (k i R). To achieve an improvement we now again make the following assumptions (1) approximation of P (k i R) can be achieved by the distribution of the index terms k i among the documents retrieved so far (2) approximation of P (k i R) can be done by considering that all non-retrieved documents are non-relevant. Now we can write, P (k i R) = V i V P (k i R) = n i V i N V (2.11) (2.12) Now, the above process can be repeated recursively, which will improve the probabilities P (k i R) and P (k i R). Thus the main advantage of the probabilistic model, is that the documents are ranked in the decreasing order of their probability of being relevant. The disadvantages of this approach are (1) The initial step of assuming the separation of documents into relevant and non-relevant sets (2) The model does not take into consideration either the term frequency of the index term in the document or the inverse document frequency of the index term in the collection (3) The model assumes the index terms are independent and have no relationship between them. 16

27 2.2.4 Comparison of the Classic Models From our discussion in the previous sections it is clear that the Boolean model is the weakest of all the classic models. Its main disadvantage being the inability to recognize the partial matches leading to poor performance. Experiments performed by Croft [15] suggest that the probabilistic model provides better performance than the vector model. However, experiments performed afterwards by Salton and Buckley [20] showed through several different measures that the vector model outperforms the probabilistic model with general collections. Thus the vector model is the most popular model used by researchers, practitioners and the web community. 2.3 Alternative Set Theoretic Models Two alternate set theoretic models are popular. Namely the fuzzy set model and the extended Boolean model. In this section we discuss these two models in brief Extended Boolean Model The extended Boolean model first appeared in the communications of the ACM article in 1983, by Salton, Fox and Wu [22]. In the boolean model for a query of the form q=k x k y, only a document containing both the index terms k x and k y are retrieved. There is no difference between a document which contains either the term k x or the term k y or neither of them. However, the extended Boolean model allows us to fetch even partially matching queries just as the vector space model. It combines both the vector space model and Boolean algebra to calculate the similarities between queries and documents. Consider the scenario where only two terms (k x and k y ) are present in the query. We can now map the documents and queries in a two-dimensional map as shown in Figure 2.1. Weights w 1 and w 2 are computed for the terms k x and k y respectively in the document d j. The weights can be computed using the tf-idf factors of the vector space model as follows. 17

28 Figure 2.1: Extended Boolean Model: The map at the left shows the similarities of q or = (k x k y ) with documents d j and d j+1. The map at the right shows the similarities of q and = (k x k y ) with documents d j and d j+1. [6] w 1 = tf x,j idf x max i idf i (2.13) where tf x,j is the term frequency (normalized) for the term k x in the document d j, idf x is the inverse document frequency for the term k x in the entire collection and idf i is the inverse document frequency for a generic term k i. Similarly, the weight for the term k y can be calculated. Now form the map shown at the left in Figure 2.1 we see that for the query q or (k x k y ), the point (0,0) is the spot to be avoided, and form the map shown at the right in Figure 2.1 for the query q and (k x k y ), the spot (1,1) is the most desirable. This indicates that for the query q or, we take the distance from the spot (0,0) as a measure of similarity and for the query q and, the complement of the distance from the spot (1,1) as the measure of similarity. Thus we arrive at the formula: w 2 sim(q or, d j ) = 1 + w2 2 2 (1 w1 ) sim(q and, d j ) = (1 w 2 ) 2 2 (2.14) (2.15) 18

29 The 2D Boolean model discussed above can be easily extended using Euclidean distances to include a document collection of a higher t-dimensional space. The p-norm model can be used to include not only Euclidean distances but also p-distances where 1 p. Thus a generalized disjunctive query is given by: q or = k 1 p k 2 p... p k m (2.16) The similarity between q or and d j is given by: sim(q or, d j ) = p w p 1,j + wp 2,j wp t,j t (2.17) A generalized conjunctive query is now given by: q and = k 1 p k 2 p... p k m (2.18) The similarity between q and and d j is given by: sim(q and, d j ) = 1 p (1 w1,j ) p + (1 w 2,j ) p (1 w t,j ) p t (2.19) More general queries such as q = (k 1 p k 2 ) p k 3 can easily be processed by grouping the operators in a predefined order. The parameter p can be varied between 1 and infinity to vary the p-norm ranking behavior from vector based ranking to that of a Boolean like ranking (fuzzy logic). Thus the extended Boolean model is quite a powerful model for information retrieval. Though it has not been used extensively for information retrieval, it may prove itself useful later Fuzzy Set Model The fuzzy retrieval model is based on the extended Boolean model and the fuzzy set theory. Let us first understand the basic concepts of the fuzzy set theory. Fuzzy sets are those whose 19

30 elements have varying degrees of membership. In classical set theory membership of an element is assessed in binary terms, an element either belongs or does not belong to the set. This is called as a crisply defined set with every element holding the value of either 0 or 1. Fuzzy set theory allows gradual assessment of the membership of elements in a set (instead of abrupt). Fuzzy set is described with the aid of a membership function which is valued in the real interval [0,1]. A fuzzy set is thus a pair {A, m} where A is a set and m : A [0, 1]. For each x A, m(x) is called the grade of membership of x in {A, m}. Let x A. Then x is not included in the fuzzy set {A, m} if m(x) = 0, x is said to be fully included if m(x) = 1 and x is called fuzzy member if 0 < m(x) < 1. The set {x A m(x) > 0} is called the support of {A, m} and the set {x A m(x) = 1} is called its kernel. There are two classical fuzzy retrieval models: Mixed Min and Max (MMM) and Paice Model. Mixed Min and Max Model (MMM) In MMM model each index term has a fuzzy set associated with it. A document s weight with respect to an index term A is the degree of membership of the document in the fuzzy set associated with A. Documents to be retrieved for a query of the form {A or B}, should be in the fuzzy set associated with the union of these two sets A and B. Similarly documents to be retrieved for a query of the form {A and B}, should be in the fuzzy set associated with the intersection of these two sets. Thus the similarity of a document to the or query is max(w A, w B ) and similarity to the and query is min(w A, w B ). The MMM model tries to soften the boolean operators by considering the query-document similarity to be a linear combination of the min and max document weights. Thus given a document d j and index term weights w 1, w 2,..., w t for terms k 1, k 2,..., k t and the queries: 20

31 q or = (k 1 or k 2 or... or k t ) (2.20) q and = (k 1 and k 2 and... and k t ) (2.21) the MMM model computes query doc similarity as follows: sim(q or, d j ) = C or1 max(w 1, w 2,..., w t ) + C or2 min(w 1, w 2,..., w t ) (2.22) sim(q and, d j ) = C and1 min(w 1, w 2,..., w t ) + C and2 max(w 1, w 2,..., w t ) (2.23) where C or1 and C or2 are the softness coefficients for the or operator, and C and1 and C and2 are the softness coefficients for the and operator [23]. For an or term, we would like to give more importance to the maximum term weights and for an and term, more importance to the minimum term weights. Thus we have C or1 > C or2 and C and1 > C and2. For simplicity we generally assume C or1 = 1 C or2 and C and1 = 1 C and2. Experiments conducted by Lee and Fox [24], show that best performance of the MMM model occurs with C and1 in the range [0.5, 0.8] and with C or1 > 0.2. The computational cost of the MMM model is generally low and retrieval effectives is generally better than the standard Boolean model. Paice Model The Paice model [25] is an extension to the MMM model. The MMM model considers only the maximum and minimum term weights, while the Paice model incorporates all the term weights when calculating the similarity. Thus, 21

32 t r i 1 w i sim(q, d j ) = (2.24) t i=1 r j 1 where r is a constant co-efficient and w i is the term weights arranged in ascending order for and queries and descending order for or queries. When t = 2 the Paice model shows the same behavior as the MMM model. Experiments by Lee and Fox [24], show that setting r = 0.1 for and queries and r = 0.7 for or queries gives good retrieval effectiveness. However, this method is more expensive when compared to the MMM model due to the fact that the term weights have to be sorted in ascending and descending order, depending on whether an and clause or an or clause is being j=1 considered. The MMM model only requires determination of min or max of a set of term weights hence can be done in O(t). The Paice model requires at least O(t logt) for the sorting algorithm along with more floating point calculations. Fuzzy set models have been mainly discussed in the literature dedicated to fuzzy theory and are not too popular among the information retrieval community. Also, majority of the experiments carried out has considered only smaller collections which make comparison difficult at this time. 2.4 Alternate Algebraic Models Three alternate algebraic models are discussed in this section, namely, generalized vector space model, latent semantic indexing model, and the neural network model Generalized Vector Space Model In conventional vector space model (VSM) proposed by Salton [16, 19], the index terms are basic vectors in a vector space and each query is represented as a linear combination of these vectors [26]. The IR retrieval process involves the use of both the query vectors and the 22

33 document vectors to compute a cosine similarity to rank the documents according to their degree of similarity with the query. The term frequency of the terms in a document is used to represent the components of the document vector. This model assumes that the term vectors are orthogonal i.e., for each pair of index terms k i and k j we have k i k j = 0. However, the terms in a document collection are generally correlated and an efficient IR model takes these term correlations into consideration. This interpretation led to the development of the generalized vector space model (GVSM) where the term vectors may be correlated and hence non-orthogonal. In GVSM, the queries are presented as a list of terms with their corresponding weights. Thus GVSM cannot ideally handle Boolean queries (of the form AND, OR or NOT ). However Wong, Ziarko, Raghavan and Wong [26], show that GVSM can also be extended to handle situations where Boolean expressions are used as queries. Let (k 1, k 2,..., k t ) be the set of index terms in a document collection. Let B 2 t be the set of all possible Boolean expressions (also the number of possible patterns of term co-occurrence) using these index terms and the operators AND, OR and NOT. To represent every possible Boolean expression in B 2 t as a vector in vector space, we need to have a set of basis vectors corresponding to a set of fundamental expressions which can be combined to generate any element of the Boolean algebra. This leads to the notion of an atomic expression or a minterm. A minterm in t literals (k 1, k 2,..., k t ) is a conjunction of literals where each term k i appears exactly once in either its complemented or uncomplemented form. Thus in all there are 2 t minterms. The conjunction of any two minterms is always false (zero) and a Boolean expression involving (k 1, k 2,..., k t ) can be expressed as a disjunction of the minterms. Let us define the following set of m i vectors 23

34 m 1 = (1, 0,..., 0, 0) (2.25) m 2 = (0, 1,..., 0, 0) (2.26). (2.27) m 2 t = (0, 0,..., 0, 1) (2.28) where each vector m i is associated with the respective minterm m i. Now given these basic vectors, the vector representation of any Boolean expression is given by the vector sum of the basic vectors. Notice that for all i j, m i m j = 0. Thus the set of m i vectors is pairwise orthonormal. If two vectors are not orthonormal then their corresponding Boolean expressions should have atleast one minterm in common. To determine an expression for the index term vector k i associated with the index term k i let us use m i2 t to denote the set of all atomic expressions. Each term k i is an element of Boolean algebra generated and can be expressed in the disjunctive normal form (sum of the vectors for all minterms) as: k i = m i1 OR m i2... OR m ip (2.29) where m ij s are the minterms in which k i is uncomplemented and 1 < j < 2 t. If we denote the set of minterms in Equation (2.29) as m i, we can define the term k i as or, k i = m r m i m r (2.30) where k i = c ir m r (2.31) 2 t i=1 24

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.

A B S T R A C T International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Information Retrieval Models and Searching Methodologies: Survey Balwinder Saini*,Vikram Singh,Satish