Dependence among Terms in Vector Space Model

Dependence among Terms in Vector Space Model Ilmério Reis Silva, João Nunes de Souza, Karina Silveira Santos Faculdade de Computação - Universidade Federal de Uberlândia (UFU) e-mail: [ilmerio, Nunes]@facom.ufu.br, karinass@pop.com.br Abstract The vector space model is a mathematical-based model that represents terms, documents and queries by vectors and provides a ranking. In this model, the subspace of interest is formed by a set of pairwise orthogonal term vectors, indicating which terms are mutually independent. However, this is an over simplification. With this in view, we present, in this work, an extension to the vector space model to take into account the correlation among terms. In the proposed model, term vectors, originally orthogonal, are rotated in space geometrically reflecting the dependence semantics among terms. This rotation is done with any technique that generates information on the relationship among terms of the collection. We propose the technique of association rules in information retrieval to find sets of terms that co-occur in documents collection. The retrieval effectiveness of the proposed model is evaluated and the results show that our model improves in average precision, relative to the standard Vector Model, for all collections evaluated, leading to a gain up to 31%. 1. Introduction In information retrieval (IR), the vector space model is the most popular [16,17,18]. Its definition of weight of term in the document and the partial matching of the query with the documents results in a good ranking strategy. Besides, it is simple and fast [4,7]. Although it is one of the models of information retrieval most used, the vector space model presents disadvantages [6,21]. The documents are represented by keywords extracted from themselves, and the relationship among them is not considered. It means, for instance, that the context in which the terms are inserted is not represented. This is an over simplification of the model. Many words have multiple meanings, and the terms of a query can literally match the terms of an irrelevant document. Considering this, the main objective of this work is to incorporate information of correlation among terms in the collection to the vector space model to improve its retrieval effectiveness. The proposed solution alters the representation of term vectors in the vector space model. In this model, terms are represented by orthogonal vectors since it is not known a priori any correlation among the terms. The algorithm proposed in this work has as it main foundation the rotation of those vectors in the space, so that their representations reflect the dependence among the terms. All term vectors which have some correlation with one or more terms are rotated in the space. After all rotations, term vectors are not necessarily orthogonal among themselves. In the set of resultant vectors, the proximity between the vectors is related to the degree of dependence between the respective terms. The closer the term vectors, the greater the dependence observed between them. The rotation of term vectors is based on techniques that result in information on the relationship among terms of the collection. We have presented the data mining association rules technique to obtain this information.

The remaining of this paper is organized as follows. In the section immediately bellow we discuss some related work. Section three describes foundations of vector space model. In section four we present the association rules in the context of information retrieval. The proposed model is described in section five. The experimental results are discussed in section six. Finally, we present some conclusions and future works. 2. Related Work Several approaches for the incorporation of correlation among terms have already been presented in the relevant literature. We describe the works related to this paper. Query expansion in the vector space model is suggested in several proposals, among them, [7,11,12,20]. In [20], Voorhees examined the usefulness of lexical query expansion in the collection TREC. Voorhees obtained considerable improvements of effectiveness just in the use of short queries. Mandala et al. [11] analyzed the characteristics of different thesaurus types and proposed a method to combine them and to expand queries. In [12], Nie and Jin, used the logical operator OR to connect expansion terms with the original terms of the query. In [5], Becker and Kuropka expose a model of IR for the comparison of documents that represents topics, terms and documents as vectors. The basis of the space is formed by a set of orthogonal topic vectors, where term vectors are represented. The angle between the term vectors and the weight of the term is calculated using information about the collection, such as, for instance, a list of radicals of the collection terms. A work similar to the proposed herein was accomplished by Possas et al. in [13,14,15]. An extension to the Vector space model was suggested considering the correlation among the terms, obtained using association rules. In [15], a new model is presented, named set-based model, for computing term weights, based on set theory, and for ranking documents. For computing those weights, the theory of the association rules is used. The proposal presented by the authors in [14] is similar to [15], and the main difference consists in how association rules are used. Then, in [13], an extension to the set-based model is proposed using information about proximity among the terms of the query in the documents. The generalized vector space model (GVSM) is another extension of the vector space model, which contemplates the correlation among terms [22,23]. In GVSM, the terms can be non-orthogonal and are represented by smaller components named minterms. The minterms are vectors, with binary weights, which indicate all co-occurrence possibilities of terms in documents. The basis for GVSM is formed by a set of 2 t (t is the number of distinct terms in the collection) minterms vectors. The term vectors are linear combinations of minterms, reflecting co-occurrence proceeding from minterms. Our work differs from the above related works in the following aspects. In none of the cited works, the term vectors are rotated in the space to reflect their correlation as we have done in this work. Moreover, the association rules are used to determine the proximity among term vectors, differing from the cited models. 3. Vector Space Model The Vector space model was, initially, proposed by Gerard Salton [16,17]. In said model, all relevant objects for a information retrieval system are represented as vectors: terms, documents and queries.

Each term k i is represented as a t-dimensional vector, where t is the number of distinct terms in the collection. In the vector space model, the vector k i represents the term k i. If a r is the r th element of the vector k i, then k i = (a 1, a 2,..., a t ) where that is, a r = 0 r i a r = 1 r = i k 1 = (1, 0, 0,...,0) k 2 = (0, 1, 0,...,0) O k t = (0, 0, 0,..,1) The set of all term vectors K = {k 1, k 2,..., k t } is linearly independent and forms the canonical basis for space R t. The vectors of terms are pairwise orthogonal and, in consequence, the corresponding terms are considered independent. Document and query vectors are represented using the set K of term vectors. These vectors are built as linear combinations of the term vectors. The vector d j associated with the document d j is defined as: t d j = i= 1 w i,j k i or d j = (w 1,j, w 2,j,..., w t,j ) Similarly, the vector for query q is defined as: t q = i= 1 w i,q k i or q = (w 1,q, w 2,q,..., w t,q ) In the equalities above, w i,j and w i,q are weights of term i in document j and in query q, respectively. The most efficient definition of term weights for the information retrieval is named tf-idf [4]. This strategy considers the number of times an index term occurs in a document and the number of documents of the collection in which an index term occurs. The vector space model evaluates the degree of similarity of the document d j in relation to the query q as the correlation between the vectors d j and q. The relevance of a document for a query is proportional to the distance between the respective vectors. Usually, that correlation is quantified by the cosine of the angle among those two vectors. That is, sim(d j, q) = d j q = t i=1w i,j. w i,q d j x q t i=1 w i,j 2 t i=1 w i,q 2

The closest documents in the space to the query are considered relevant for the user and returned as answer set for the query. After the computation of the similarity degrees, it is possible to order a list of documents (ranking) and their respective degrees of relevance to the query. 4. Association Rules in Information Retrieval In the area of data mining, the association rules serve, typically, to represent frequent patterns found in the data [1,2,3,9]. The main function of the rules is to characterize the data, representing regularities. One of the purposes of this work is to use the data mining in IR. In general, the literature regarding the data mining works with items and transactions. However the algorithms used for the discovery of association rules can be adapted also to work with terms and documents, identifying the co-occurrence among terms. In IR context, X and Y are terms or sets of terms. Consider the following example, which defines an association rule in IR. The information whereby that documents whose theme is tourism discuss on hotels as well, is represented in the association rule (1) below: tourism hotel [support = 2%, confidence = 80%] (1) The support and the confidence of a rule are two measures that reflect, respectively, the usefulness and the certainty of the rules found. The support is a percentage in relation to the entire collection of documents analyzed. In the example above, in 2% of the collection, the words tourism and hotel appear simultaneously in the same document. The confidence is a percentage in relation to an attribute. A confidence of 80% reveals that 80% of the documents that discuss tourism also discuss hotels. Typically, association rules are considered useful if they meet a support and confidence threshold [12]. 4.1. Basic Concepts Let J = {k 1,k 2,...k m } be the set of distinct terms in a collection of documents D. Each document d j of the database is a set of terms such that d j J. An association rule is an implication like A B, where A J, B J, and A B =. The rule A B is valid in a set of documents D with support s, if s is the percentage of documents in D which contains A B (in other words, A and B at the same time). The rule A B has confidence c in the set of documents D if c is the percentage of documents in D having A which also contains B. Rules that meet a minimum support (min_sup) and a minimum confidence (min_conf) are termed strong. A set of terms is referred to as termset. A termset that contains k terms is a k-termset. Association rules are found in large databases in two steps: 1 Find all the sets of terms (termsets) that meet the minimum support. These termsets are named frequent termsets; 2 Generate frequent termsets strong association rules: by definition, these rules should meet the minimum support and the minimum confidence. Apriori is an algorithm for mining frequent termsets for association rules [2,3,12]. Apriori uses an iterative approach known as search in levels, where k-termsets are used to

explore (k+1)-termsets. First, the set of 1-termsets frequent is found. This set is denoted L 1. L 1 is used to find L 2, the set of frequent 2-termsets, which is used to find L 3, and so on, until no more frequent k-termsets can be found. The search of each L k requires a complete scan in the database. To improve the generation efficiency of frequent termsets, an important property called Apriori is used to reduce the search space. Once generated the frequent termsets of the transactions in the database D, the strong association rules can be generated. This can be made using the following equation for the confidence, using the termset frequency: confidence(a B) = P(B A) = freq(a B) freq(a) where freq(a B) is the number of transactions containing the termsets A B, and freq(a) is the number of transactions containing A. Based on this equation, association rules can be generated as it follows: For each frequent termset I, generate all nonempty subsets of I. For each nonempty subset s of I, generate the rule s (l-s) if freq(l) min_conf, freq(s) where min_conf is the confidence minimum threshold. In the following section, we show how association rules modify the vector space model. 5. Vector Space Model Modified by Association Rules The main foundation of the algorithm proposed in this work is the rotation of the term vectors in the space, so that its representations reflect, geometrically, the semantics of correlation of terms adopted. We have used the association rules as a tool for the generation of information about the dependence among the terms. The term vectors are rotated in the space, reflecting, in a geometric way, the semantics defined for the association rules. This method is based on the assumption that a pair of words that frequently occurs together in the same documents is related to the same subject. The association rules are of the form k i k j, and c ij is the confidence index of the rule, which indicates the degree of dependence of the term k i in relation to the term k j. That index is used, in this work, to compute the new angle between the term vectors k i and k j. The confidence was chosen as a parameter to determine the proximity of the term vectors, because it reflects the certainty of the association rule. The term vectors are brought close together according to the association rules created for the respective terms as follows: Definition 5.1 (Rotation of basis vectors): Let k i and k j be two term vectors, c ij the confidence index of the association rule k i k j. The new angle θ ij between k i and k j is given by θ ij = 90 (1 c ij ) where 90 is the original angle between the vectors k i and k j. In this case, the rotation occur only in the vector k i, the vector k j is not modified. The reason for this is related to the

semantics of the association rule and the confidence. The index c ij of the association rule k i k j determines that, in c% times the term k i appears, the term k j also appears. Therefore, the rotation is made in the vector corresponding to the term of the antecedent of the association rule. θ ij is the new vector between the term vectors k i and k j whenever θ ij < 90º. The vector k i approaches the vector k j, and the new vector is named k i, where the r th element of the vector k i, named a r, is defined as: a r = sin(θ ij ) r = i a r = cos(θ ij ) r = j a r = 0 r i and r j Therefore, the vector k i is transformed in vector k i = (a 1, a 2,..., a t ), altering the positions i and j of the original vector. In position i, we have sin(θ ij ) and, in position j, we have cos(θ ij ). In case a term k p presents two or more associated terms, a normalization is made in the new vector k p as it follows. Let k p k n and k p k v be two association rules, with equal antecedents and respective confidences c pn, c pv, the new vector k p is defined as k p = k pn + k pv k pn + k pv where k pn is the vector k p modified using k p k n and c pn (definition 5.1), k pv is the modified vetor using k p k v and c pv. The vector space basis K is formed by the sets of term vectors {k 1, k 2,..., k t }. After the rotation of the term vectors, the new basis for the vector space, denoted K, is obtained from K, replacing the vectors k i by k i, so K = {k 1, k 2,..., k t }. The set K continues forming the basis of the vector space R t because their vectors are linearly independent. The document and query vectors, d j and q, are represented in the new basis K as linear combination of terms vectors k i. Document and query vectors are termed d j and q and defined as: t t d j = w ij k i q = w iq k i i= 1 So, document and query vectors, d j and q, reflect, now, the dependence semantics among the terms, implicit in basis K. The same function in the computing of the similarity is used in the vector space model modified by dependence among the terms. Therefore, we have, sim(d j, q) = d j q = t i=1w i,j k i. t s=1 w s,q k s = t i,s=1w i,j k i. w s,q k s d j x q t i=1 w 2 i,j t 2 s=1 w s,q t 2 i=1 w i,j t 2 s=1 w s,q The similarity between the query and documents is modified due to the changes in the respective vectors, now non-orthogonal. The normalization of the similarity, or the factors in the denominator of the formula, is made using the original norm of the documents. That strategy was adopted because otherwise, should the normalization use the document vectors i= 1

d j, the norm of all the documents would have to be recalculated, elevating the computational costs of calculation the similarity. Besides, that simplification does not change the results significantly. In the computation of the similarity between the query and the documents, the main consequence in term vectors rotation is the automatic query expansion. The query is expanded with terms related to their original terms. Besides, documents which have query terms and associated query terms occupy a position in the ranking above the documents that just have the terms of the query. 5.1. Algorithm The implementation of the model presented is divided in two phases. The first is the generation of the information on the dependence among the terms, which means the construction of vectors k i. This task is thoroughly accomplished in the pre-processing phase. The second phase is the development of the proposed model. The search algorithm used in the implementation of the vector space model modified by dependence among the terms, described in Figure 1, is similar to the original model. It considers A a list of accumulators, with each item A j of A storing the partial similarity of the document d j in relation to the query q. The function value(k i, i) returns the value stored in the position i in the vector of the term k i. The necessary modifications to the original algorithm to reflect the dependence among the terms, are in step (2) and in the loop of step (6). (1) Create and initialize a structure of accumulators (A) (2) For each query term k i, add to the query all the terms associated. (3) For each term k i of the modified query do: (4) For each pair [d j, f ij ] in the term inverted list do: (5) aux = w ij * w iq * (value(k i, i)) 2 (6) For each term k j associated to term k i do: (7) aux = aux + (w ij * w iq * value(k i, i) * value(k i, j)) (8) End For (9) if A j A then (10) A j = aux (11) else (12) A j = A j + aux (13) A = A + {A j } (14) End For (15) End For (16) Divide each accumulator A j by the document norm d j. (17) Order the list of accumulators A j and return the documents d j retrieved. Figure 1. Search algorithm for the Vector space model modified by dependence among the terms. In step (2), there is a difference in relation to the original algorithm. Once determined the identifiers of query terms, the terms associated to each term of the query are added to the list of query terms. This step of the algorithm defines the automatic expansion of the query with the terms related to the query terms.

Steps of (5) to (8) are equal to the sum w i,j k i w s,q k s of the equation of the i, s= 1 internal product between the vectors d j and q. Step (5) corresponds to the sum for i = s. And the loop of step (6) corresponds to the other cases, when i s. These steps are necessary because the term vectors are non-orthogonal. When analyzing the algorithm, we clearly notice that the proposed model is an extension to the original vector space model. That is justified because, if no association among the terms exists, the algorithm described is equivalent to the original algorithm. 6. Experiments To evaluate the efficiency of the vector space model modified by dependence among the terms, the experiments were made with four reference collections named CACM [8], Cystic Fribosis (CFC) [19], CISI and Third Text Retrieval Conference (TREC-3) [10]. The collection characteristics are shown in Table 1. Reference collections Table 1. Characteristics of the reference collections. Number of distinct terms Number of documents Average number of terms per document t Number of queries Average number of terms per query Average relevant documents per query CFC 2105 1239 12,2 64 4,0 39 CACM 8716 1602 46,6 50 12,7 13 CISI 9728 1460 53,6 50 9,4 50 TREC-3 1749555 741855 301,1 50 18,58 106,38 The evaluation of the IR system proposed here is related with the effectiveness of the retrieval, in other words, how much precise the answer set is returned by the system for a given query. We used the precision-recalls curves to compare the effectiveness of the vector space model modified by dependence among terms with the one of the classic vector space model. Each curve quantifies the precision as a function of the percentage of the documents retrieved (recall). In the computing of the association rules, some parameters can be adjusted during the process of generation of association rules. Min_sup and min_conf are, respectively, support and confidence thresholds. We accomplished experiments and observed that min_sup should contain a low value (up to 5%) because, in general, the frequency of terms in collections is low. Besides, in case min_sup is low, association rules, involving terms whose frequency is small in the collection of documents, are discarded. On the other hand, min_conf should contain a higher value (above 40%), because this parameter determines the approach among the vectors. In case min_conf contains a low value, term vectors which have very low co-occurrence are brought close together. This harms the effectiveness of the retrieval, because the system will expand the query with terms not related to query terms. As we can see in Figures 2 and 3, the proposed model yields better precision than Vector Space Model, regardless of the collection and of the recall level. Table 2 presents a summary of the results obtained, in which the averages of precision are exhibited for the two models in all collections and the gains obtained of the model proposed in relation to the original.

Recall x Precision CACM Recall x Precision CISI 80% 60% VS M MVSM 80% 60% VS M MVSM 40% 40% 20% 20% 0% 0% 20% 40% 60% 80% 100% 0% 0% 20% 40% 60% 80% 100% Figure 2. Recall-Precision for CACM and CISI. Recall x Precision CFC Recall x Precision TREC-3 80% VS M 80% VS M 60% MVSM 60% MVSM 40% 40% 20% 20% 0% 0% 20% 40% 60% 80% 100% 0% 0% 20% 40% 60% 80% 100% Figure 3. Recall-Precision for CFC and TREC-3. Table 2. Average Precision Curves and gain provided by the vector space model modified by association rules. Collection Average Precision (%) Classic Modified Gain (%) CACM 30,03 32,08 6,83 CISI 17,64 20,09 13,89 CFC 10,05 13,24 31,74 TREC-3 12,09 14,04 16,13 The results presented for the vector space model modified by association rules are the best ones, considering the analysis of the parameters values described. Then, for maximum min_sup from 4% to 5%, and for min_conf alternating between 45% and 70%, the variation of the results is minimum in relation to the one presented. When defining the minimum confidence with a value up to 70%, few rules are generated and, consequently, the results approach more those presented for the classic vector space model. The various

possibilities of values of the parameters were tested. However, the collections behave in a similar way in their alteration. The experiments have shown that the proposed model improves the average precision of the answer set for all collections. Besides, the medium precision obtained was not harmed by the recall increase occurred when expanding the queries. 7. Conclusions In this paper, we have presented an extension to the vector space model to reflect the dependence among the terms of the collection. In the proposed model, the dependence among the terms is represented geometrically in the vector space. The proposed model is based on the rotation of the term vectors, in agreement with the dependence among the terms. This rotation is made based on techniques that generate information on the correlation among terms of the collection. In this work, we used the association rules. However, other techniques can be used. The generation of association rules is a known technique of data mining, which allows finding frequent patterns in large databases. In the context of this paper, it is used to find sets of terms that appear simultaneously in the collection of documents. This information is useful to modify the term vectors, so that they reflect the semantics of co-occurrence defined for the association rules. The extension to the vector space model we here presented contemplates the dependence among terms in a clear, flexible and new way. It is clear because the dependence incorporation among the terms is made step by step and the vector space basis reflects the semantics defined for the adopted technique. The proposed model is flexible because it allows the correlation incorporation among the terms of collection obtained in several ways. Finally, the proposal is new because in the relevant literature there is not an extension to the vector space model which modifies the vector space basis as it was done in this work. We have evaluated the effectiveness of the model proposed with four reference collections. There was an increase in the retrieval model effectiveness in comparison with the classic vector space model for all of the reference collections used. As future works, the effectiveness of the proposed model will be compared to the effectiveness of the generalized vector space model. Besides, we will research other methods of obtaining correlation among the terms of a collection of documents. These methods will be incorporated in a geometric way to the model proposed in this paper. We also intend to evaluate the model proposed for larger collections formed by Web documents. References 1. Adriaans, P., Zantige, D. Data Mining. Inglaterra, Addison-Wesley, 1996. 2. Agrawal, R., Imielinski, T., Swami, A. Mining association rules between sets of items in large databases. Proceedings of the ACM SIGMOD Conference. Washington, DC, USA, p. 207-216, may 1993. 3. Agrawal, R., Srikant, R. Fast algorithms for mining association rules. Proceedings of the 20th Int l Conference on Very Large Databases. Santiago, Chile, September 1994. 4. Baeza-Yates, R., Ribeiro-Neto, B. Modern information retrieval. ACM/Addison-Wesley, 1999. 5. Becker, J., Kuropka, D. Topic-based vector space model. Proceedings of the 6th International Conference on Business Information Systems, Colorado Springs, June 2003, p. 7-12.

6. Bollmann-Sdorra, P., Raghavan, V. V. On the necessity of term dependence in a query space for weighted retrieval. Journal of the American Society of Information Science, 49(13): 1161-1168, 1998. 7. Buckley, C., Salton, G., Allan, J., Singhal, A. Automatic query expansion using SMART : TREC 3. In D. K. Harmon, editor, NIST Special Publication 500-225: The Third Text Retrieval conference (TREC 3), 1995, p. 69-80. 8. CAM-Collection. ftp://ftp.cs.cornell.edu/pub/smart/cacm. 9. Han, J., Kamber, M. Data mining Concepts and techniques. San Diego: Academic Press, 2001, p.335-393. 10. Harman, D. Overview of the third Text Retrieval Conference. Proceedings of the third Text Retrieval Conference (TREC-3), Gaithersburg, MD,USA,1995, p. 1-20. 11. Mandala, R., Tokunaga, T., Tanaka, H. M. Combining multiple evidence from different types of thesaurus for query expansion. Proceedings of the 22th annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley, California, United States, August 1999, p. 191-197. 12. Nie, J. Y., Jin, F. Integrating logical operators in query expansion in Vector Space Model. Workshop on Mathematical/Formal Methods in Information Retrieval, 25th ACM-SIGIR, Tampere, Finland, August 2002. 13. Pôssas, B, Ziviani, N., Meira-Jr, W., Enhancing the set-based model using proximity information. Proceedings of the 9th International Symposium of String Processing and Information Retrieval, Lisbon, Portugal, September 2002, p. 104-116. 14. Pôssas, B, Ziviani, N., Meira-Jr, W., Ribeiro-Neto, B. Modelagem vetorial estendida por regras de associação. XVI Simpósio Brasileiro de Banco de Dados, Rio de Janeiro, Brasil, 2001. 15. Pôssas, B, Ziviani, N., Meira-Jr, W., Ribeiro-Neto, B. Set-based model: A new approach for information retrieval. Proceedings of the 25th International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, August 2002. 16. Salton, G. (ed) The SMART retrieval system experiments in automatic document processing. Englewood Cliffs, NJ: Prentice Hall, 1971. 17. Salton, G., Lesk, M. E. Computer evaluation of indexing and text processing. Journal of the ACM, 15(1):8-36, Janeiro 1968. 18. Salton, G., McGill M. J. Introduction to modern information retrieval. MacGraw Hill, New York, 1983. 19. Shaw, W. M., Wood, R. E, Tiboo, H. R. The cystic fibrosis database: Content and research opportunities. Library and Information Science Research,13:347-366, 1991. 20. Voorhees E. M. Query expansion using lexical-semantic relations. Proceedings of the 17th ACM- SIGIR Conference, 1993, p. 171-180. 21. Wong, S. K.M., Raghavan, V. V. The vector space model of information retrieval A reevaluation. Proceedings of the 7th annual international ACM SIGIR conference on Research and development in information retrieval, Cambridge, England, 1984. 22. Wong, S. K.M., Ziarko, W., Raghavan, V. V., Wong, P. C.N. On modeling of information retrieval concepts in vector spaces. Proceedings of the ACMTransactions on Database Systems Volume 12, New York, NY, USA, June 1987, p. 299 321. 23. Wong, S. K. M., Ziarko W., Wong, P. C. N. Generalized vector space model in information retrieval. Proceedings of the 8th ACM-SIGIR Conference on Research and Development in Information Retrieval. New York, USA, 1985, p. 18-25.