Dependence among Terms in Vector Space Model

Size: px
Start display at page:

Download "Dependence among Terms in Vector Space Model"

Transcription

1 Dependence among Terms in Vector Space Model Ilmério Reis Silva, João Nunes de Souza, Karina Silveira Santos Faculdade de Computação - Universidade Federal de Uberlândia (UFU) [ilmerio, Nunes]@facom.ufu.br, karinass@pop.com.br Abstract The vector space model is a mathematical-based model that represents terms, documents and queries by vectors and provides a ranking. In this model, the subspace of interest is formed by a set of pairwise orthogonal term vectors, indicating which terms are mutually independent. However, this is an over simplification. With this in view, we present, in this work, an extension to the vector space model to take into account the correlation among terms. In the proposed model, term vectors, originally orthogonal, are rotated in space geometrically reflecting the dependence semantics among terms. This rotation is done with any technique that generates information on the relationship among terms of the collection. We propose the technique of association rules in information retrieval to find sets of terms that co-occur in documents collection. The retrieval effectiveness of the proposed model is evaluated and the results show that our model improves in average precision, relative to the standard Vector Model, for all collections evaluated, leading to a gain up to 31%. 1. Introduction In information retrieval (IR), the vector space model is the most popular [16,17,18]. Its definition of weight of term in the document and the partial matching of the query with the documents results in a good ranking strategy. Besides, it is simple and fast [4,7]. Although it is one of the models of information retrieval most used, the vector space model presents disadvantages [6,21]. The documents are represented by keywords extracted from themselves, and the relationship among them is not considered. It means, for instance, that the context in which the terms are inserted is not represented. This is an over simplification of the model. Many words have multiple meanings, and the terms of a query can literally match the terms of an irrelevant document. Considering this, the main objective of this work is to incorporate information of correlation among terms in the collection to the vector space model to improve its retrieval effectiveness. The proposed solution alters the representation of term vectors in the vector space model. In this model, terms are represented by orthogonal vectors since it is not known a priori any correlation among the terms. The algorithm proposed in this work has as it main foundation the rotation of those vectors in the space, so that their representations reflect the dependence among the terms. All term vectors which have some correlation with one or more terms are rotated in the space. After all rotations, term vectors are not necessarily orthogonal among themselves. In the set of resultant vectors, the proximity between the vectors is related to the degree of dependence between the respective terms. The closer the term vectors, the greater the dependence observed between them. The rotation of term vectors is based on techniques that result in information on the relationship among terms of the collection. We have presented the data mining association rules technique to obtain this information.

2 The remaining of this paper is organized as follows. In the section immediately bellow we discuss some related work. Section three describes foundations of vector space model. In section four we present the association rules in the context of information retrieval. The proposed model is described in section five. The experimental results are discussed in section six. Finally, we present some conclusions and future works. 2. Related Work Several approaches for the incorporation of correlation among terms have already been presented in the relevant literature. We describe the works related to this paper. Query expansion in the vector space model is suggested in several proposals, among them, [7,11,12,20]. In [20], Voorhees examined the usefulness of lexical query expansion in the collection TREC. Voorhees obtained considerable improvements of effectiveness just in the use of short queries. Mandala et al. [11] analyzed the characteristics of different thesaurus types and proposed a method to combine them and to expand queries. In [12], Nie and Jin, used the logical operator OR to connect expansion terms with the original terms of the query. In [5], Becker and Kuropka expose a model of IR for the comparison of documents that represents topics, terms and documents as vectors. The basis of the space is formed by a set of orthogonal topic vectors, where term vectors are represented. The angle between the term vectors and the weight of the term is calculated using information about the collection, such as, for instance, a list of radicals of the collection terms. A work similar to the proposed herein was accomplished by Possas et al. in [13,14,15]. An extension to the Vector space model was suggested considering the correlation among the terms, obtained using association rules. In [15], a new model is presented, named set-based model, for computing term weights, based on set theory, and for ranking documents. For computing those weights, the theory of the association rules is used. The proposal presented by the authors in [14] is similar to [15], and the main difference consists in how association rules are used. Then, in [13], an extension to the set-based model is proposed using information about proximity among the terms of the query in the documents. The generalized vector space model (GVSM) is another extension of the vector space model, which contemplates the correlation among terms [22,23]. In GVSM, the terms can be non-orthogonal and are represented by smaller components named minterms. The minterms are vectors, with binary weights, which indicate all co-occurrence possibilities of terms in documents. The basis for GVSM is formed by a set of 2 t (t is the number of distinct terms in the collection) minterms vectors. The term vectors are linear combinations of minterms, reflecting co-occurrence proceeding from minterms. Our work differs from the above related works in the following aspects. In none of the cited works, the term vectors are rotated in the space to reflect their correlation as we have done in this work. Moreover, the association rules are used to determine the proximity among term vectors, differing from the cited models. 3. Vector Space Model The Vector space model was, initially, proposed by Gerard Salton [16,17]. In said model, all relevant objects for a information retrieval system are represented as vectors: terms, documents and queries.

3 Each term k i is represented as a t-dimensional vector, where t is the number of distinct terms in the collection. In the vector space model, the vector k i represents the term k i. If a r is the r th element of the vector k i, then k i = (a 1, a 2,..., a t ) where that is, a r = 0 r i a r = 1 r = i k 1 = (1, 0, 0,...,0) k 2 = (0, 1, 0,...,0) O k t = (0, 0, 0,..,1) The set of all term vectors K = {k 1, k 2,..., k t } is linearly independent and forms the canonical basis for space R t. The vectors of terms are pairwise orthogonal and, in consequence, the corresponding terms are considered independent. Document and query vectors are represented using the set K of term vectors. These vectors are built as linear combinations of the term vectors. The vector d j associated with the document d j is defined as: t d j = i= 1 w i,j k i or d j = (w 1,j, w 2,j,..., w t,j ) Similarly, the vector for query q is defined as: t q = i= 1 w i,q k i or q = (w 1,q, w 2,q,..., w t,q ) In the equalities above, w i,j and w i,q are weights of term i in document j and in query q, respectively. The most efficient definition of term weights for the information retrieval is named tf-idf [4]. This strategy considers the number of times an index term occurs in a document and the number of documents of the collection in which an index term occurs. The vector space model evaluates the degree of similarity of the document d j in relation to the query q as the correlation between the vectors d j and q. The relevance of a document for a query is proportional to the distance between the respective vectors. Usually, that correlation is quantified by the cosine of the angle among those two vectors. That is, sim(d j, q) = d j q = t i=1w i,j. w i,q d j x q t i=1 w i,j 2 t i=1 w i,q 2

4 The closest documents in the space to the query are considered relevant for the user and returned as answer set for the query. After the computation of the similarity degrees, it is possible to order a list of documents (ranking) and their respective degrees of relevance to the query. 4. Association Rules in Information Retrieval In the area of data mining, the association rules serve, typically, to represent frequent patterns found in the data [1,2,3,9]. The main function of the rules is to characterize the data, representing regularities. One of the purposes of this work is to use the data mining in IR. In general, the literature regarding the data mining works with items and transactions. However the algorithms used for the discovery of association rules can be adapted also to work with terms and documents, identifying the co-occurrence among terms. In IR context, X and Y are terms or sets of terms. Consider the following example, which defines an association rule in IR. The information whereby that documents whose theme is tourism discuss on hotels as well, is represented in the association rule (1) below: tourism hotel [support = 2%, confidence = 80%] (1) The support and the confidence of a rule are two measures that reflect, respectively, the usefulness and the certainty of the rules found. The support is a percentage in relation to the entire collection of documents analyzed. In the example above, in 2% of the collection, the words tourism and hotel appear simultaneously in the same document. The confidence is a percentage in relation to an attribute. A confidence of 80% reveals that 80% of the documents that discuss tourism also discuss hotels. Typically, association rules are considered useful if they meet a support and confidence threshold [12] Basic Concepts Let J = {k 1,k 2,...k m } be the set of distinct terms in a collection of documents D. Each document d j of the database is a set of terms such that d j J. An association rule is an implication like A B, where A J, B J, and A B =. The rule A B is valid in a set of documents D with support s, if s is the percentage of documents in D which contains A B (in other words, A and B at the same time). The rule A B has confidence c in the set of documents D if c is the percentage of documents in D having A which also contains B. Rules that meet a minimum support (min_sup) and a minimum confidence (min_conf) are termed strong. A set of terms is referred to as termset. A termset that contains k terms is a k-termset. Association rules are found in large databases in two steps: 1 Find all the sets of terms (termsets) that meet the minimum support. These termsets are named frequent termsets; 2 Generate frequent termsets strong association rules: by definition, these rules should meet the minimum support and the minimum confidence. Apriori is an algorithm for mining frequent termsets for association rules [2,3,12]. Apriori uses an iterative approach known as search in levels, where k-termsets are used to

5 explore (k+1)-termsets. First, the set of 1-termsets frequent is found. This set is denoted L 1. L 1 is used to find L 2, the set of frequent 2-termsets, which is used to find L 3, and so on, until no more frequent k-termsets can be found. The search of each L k requires a complete scan in the database. To improve the generation efficiency of frequent termsets, an important property called Apriori is used to reduce the search space. Once generated the frequent termsets of the transactions in the database D, the strong association rules can be generated. This can be made using the following equation for the confidence, using the termset frequency: confidence(a B) = P(B A) = freq(a B) freq(a) where freq(a B) is the number of transactions containing the termsets A B, and freq(a) is the number of transactions containing A. Based on this equation, association rules can be generated as it follows: For each frequent termset I, generate all nonempty subsets of I. For each nonempty subset s of I, generate the rule s (l-s) if freq(l) min_conf, freq(s) where min_conf is the confidence minimum threshold. In the following section, we show how association rules modify the vector space model. 5. Vector Space Model Modified by Association Rules The main foundation of the algorithm proposed in this work is the rotation of the term vectors in the space, so that its representations reflect, geometrically, the semantics of correlation of terms adopted. We have used the association rules as a tool for the generation of information about the dependence among the terms. The term vectors are rotated in the space, reflecting, in a geometric way, the semantics defined for the association rules. This method is based on the assumption that a pair of words that frequently occurs together in the same documents is related to the same subject. The association rules are of the form k i k j, and c ij is the confidence index of the rule, which indicates the degree of dependence of the term k i in relation to the term k j. That index is used, in this work, to compute the new angle between the term vectors k i and k j. The confidence was chosen as a parameter to determine the proximity of the term vectors, because it reflects the certainty of the association rule. The term vectors are brought close together according to the association rules created for the respective terms as follows: Definition 5.1 (Rotation of basis vectors): Let k i and k j be two term vectors, c ij the confidence index of the association rule k i k j. The new angle θ ij between k i and k j is given by θ ij = 90 (1 c ij ) where 90 is the original angle between the vectors k i and k j. In this case, the rotation occur only in the vector k i, the vector k j is not modified. The reason for this is related to the

6 semantics of the association rule and the confidence. The index c ij of the association rule k i k j determines that, in c% times the term k i appears, the term k j also appears. Therefore, the rotation is made in the vector corresponding to the term of the antecedent of the association rule. θ ij is the new vector between the term vectors k i and k j whenever θ ij < 90º. The vector k i approaches the vector k j, and the new vector is named k i, where the r th element of the vector k i, named a r, is defined as: a r = sin(θ ij ) r = i a r = cos(θ ij ) r = j a r = 0 r i and r j Therefore, the vector k i is transformed in vector k i = (a 1, a 2,..., a t ), altering the positions i and j of the original vector. In position i, we have sin(θ ij ) and, in position j, we have cos(θ ij ). In case a term k p presents two or more associated terms, a normalization is made in the new vector k p as it follows. Let k p k n and k p k v be two association rules, with equal antecedents and respective confidences c pn, c pv, the new vector k p is defined as k p = k pn + k pv k pn + k pv where k pn is the vector k p modified using k p k n and c pn (definition 5.1), k pv is the modified vetor using k p k v and c pv. The vector space basis K is formed by the sets of term vectors {k 1, k 2,..., k t }. After the rotation of the term vectors, the new basis for the vector space, denoted K, is obtained from K, replacing the vectors k i by k i, so K = {k 1, k 2,..., k t }. The set K continues forming the basis of the vector space R t because their vectors are linearly independent. The document and query vectors, d j and q, are represented in the new basis K as linear combination of terms vectors k i. Document and query vectors are termed d j and q and defined as: t t d j = w ij k i q = w iq k i i= 1 So, document and query vectors, d j and q, reflect, now, the dependence semantics among the terms, implicit in basis K. The same function in the computing of the similarity is used in the vector space model modified by dependence among the terms. Therefore, we have, sim(d j, q) = d j q = t i=1w i,j k i. t s=1 w s,q k s = t i,s=1w i,j k i. w s,q k s d j x q t i=1 w 2 i,j t 2 s=1 w s,q t 2 i=1 w i,j t 2 s=1 w s,q The similarity between the query and documents is modified due to the changes in the respective vectors, now non-orthogonal. The normalization of the similarity, or the factors in the denominator of the formula, is made using the original norm of the documents. That strategy was adopted because otherwise, should the normalization use the document vectors i= 1

7 d j, the norm of all the documents would have to be recalculated, elevating the computational costs of calculation the similarity. Besides, that simplification does not change the results significantly. In the computation of the similarity between the query and the documents, the main consequence in term vectors rotation is the automatic query expansion. The query is expanded with terms related to their original terms. Besides, documents which have query terms and associated query terms occupy a position in the ranking above the documents that just have the terms of the query Algorithm The implementation of the model presented is divided in two phases. The first is the generation of the information on the dependence among the terms, which means the construction of vectors k i. This task is thoroughly accomplished in the pre-processing phase. The second phase is the development of the proposed model. The search algorithm used in the implementation of the vector space model modified by dependence among the terms, described in Figure 1, is similar to the original model. It considers A a list of accumulators, with each item A j of A storing the partial similarity of the document d j in relation to the query q. The function value(k i, i) returns the value stored in the position i in the vector of the term k i. The necessary modifications to the original algorithm to reflect the dependence among the terms, are in step (2) and in the loop of step (6). (1) Create and initialize a structure of accumulators (A) (2) For each query term k i, add to the query all the terms associated. (3) For each term k i of the modified query do: (4) For each pair [d j, f ij ] in the term inverted list do: (5) aux = w ij * w iq * (value(k i, i)) 2 (6) For each term k j associated to term k i do: (7) aux = aux + (w ij * w iq * value(k i, i) * value(k i, j)) (8) End For (9) if A j A then (10) A j = aux (11) else (12) A j = A j + aux (13) A = A + {A j } (14) End For (15) End For (16) Divide each accumulator A j by the document norm d j. (17) Order the list of accumulators A j and return the documents d j retrieved. Figure 1. Search algorithm for the Vector space model modified by dependence among the terms. In step (2), there is a difference in relation to the original algorithm. Once determined the identifiers of query terms, the terms associated to each term of the query are added to the list of query terms. This step of the algorithm defines the automatic expansion of the query with the terms related to the query terms.

8 Steps of (5) to (8) are equal to the sum w i,j k i w s,q k s of the equation of the i, s= 1 internal product between the vectors d j and q. Step (5) corresponds to the sum for i = s. And the loop of step (6) corresponds to the other cases, when i s. These steps are necessary because the term vectors are non-orthogonal. When analyzing the algorithm, we clearly notice that the proposed model is an extension to the original vector space model. That is justified because, if no association among the terms exists, the algorithm described is equivalent to the original algorithm. 6. Experiments To evaluate the efficiency of the vector space model modified by dependence among the terms, the experiments were made with four reference collections named CACM [8], Cystic Fribosis (CFC) [19], CISI and Third Text Retrieval Conference (TREC-3) [10]. The collection characteristics are shown in Table 1. Reference collections Table 1. Characteristics of the reference collections. Number of distinct terms Number of documents Average number of terms per document t Number of queries Average number of terms per query Average relevant documents per query CFC ,2 64 4,0 39 CACM , ,7 13 CISI ,6 50 9,4 50 TREC , ,58 106,38 The evaluation of the IR system proposed here is related with the effectiveness of the retrieval, in other words, how much precise the answer set is returned by the system for a given query. We used the precision-recalls curves to compare the effectiveness of the vector space model modified by dependence among terms with the one of the classic vector space model. Each curve quantifies the precision as a function of the percentage of the documents retrieved (recall). In the computing of the association rules, some parameters can be adjusted during the process of generation of association rules. Min_sup and min_conf are, respectively, support and confidence thresholds. We accomplished experiments and observed that min_sup should contain a low value (up to 5%) because, in general, the frequency of terms in collections is low. Besides, in case min_sup is low, association rules, involving terms whose frequency is small in the collection of documents, are discarded. On the other hand, min_conf should contain a higher value (above 40%), because this parameter determines the approach among the vectors. In case min_conf contains a low value, term vectors which have very low co-occurrence are brought close together. This harms the effectiveness of the retrieval, because the system will expand the query with terms not related to query terms. As we can see in Figures 2 and 3, the proposed model yields better precision than Vector Space Model, regardless of the collection and of the recall level. Table 2 presents a summary of the results obtained, in which the averages of precision are exhibited for the two models in all collections and the gains obtained of the model proposed in relation to the original.

9 Recall x Precision CACM Recall x Precision CISI 80% 60% VS M MVSM 80% 60% VS M MVSM 40% 40% 20% 20% 0% 0% 20% 40% 60% 80% 100% 0% 0% 20% 40% 60% 80% 100% Figure 2. Recall-Precision for CACM and CISI. Recall x Precision CFC Recall x Precision TREC-3 80% VS M 80% VS M 60% MVSM 60% MVSM 40% 40% 20% 20% 0% 0% 20% 40% 60% 80% 100% 0% 0% 20% 40% 60% 80% 100% Figure 3. Recall-Precision for CFC and TREC-3. Table 2. Average Precision Curves and gain provided by the vector space model modified by association rules. Collection Average Precision (%) Classic Modified Gain (%) CACM 30,03 32,08 6,83 CISI 17,64 20,09 13,89 CFC 10,05 13,24 31,74 TREC-3 12,09 14,04 16,13 The results presented for the vector space model modified by association rules are the best ones, considering the analysis of the parameters values described. Then, for maximum min_sup from 4% to 5%, and for min_conf alternating between 45% and 70%, the variation of the results is minimum in relation to the one presented. When defining the minimum confidence with a value up to 70%, few rules are generated and, consequently, the results approach more those presented for the classic vector space model. The various

10 possibilities of values of the parameters were tested. However, the collections behave in a similar way in their alteration. The experiments have shown that the proposed model improves the average precision of the answer set for all collections. Besides, the medium precision obtained was not harmed by the recall increase occurred when expanding the queries. 7. Conclusions In this paper, we have presented an extension to the vector space model to reflect the dependence among the terms of the collection. In the proposed model, the dependence among the terms is represented geometrically in the vector space. The proposed model is based on the rotation of the term vectors, in agreement with the dependence among the terms. This rotation is made based on techniques that generate information on the correlation among terms of the collection. In this work, we used the association rules. However, other techniques can be used. The generation of association rules is a known technique of data mining, which allows finding frequent patterns in large databases. In the context of this paper, it is used to find sets of terms that appear simultaneously in the collection of documents. This information is useful to modify the term vectors, so that they reflect the semantics of co-occurrence defined for the association rules. The extension to the vector space model we here presented contemplates the dependence among terms in a clear, flexible and new way. It is clear because the dependence incorporation among the terms is made step by step and the vector space basis reflects the semantics defined for the adopted technique. The proposed model is flexible because it allows the correlation incorporation among the terms of collection obtained in several ways. Finally, the proposal is new because in the relevant literature there is not an extension to the vector space model which modifies the vector space basis as it was done in this work. We have evaluated the effectiveness of the model proposed with four reference collections. There was an increase in the retrieval model effectiveness in comparison with the classic vector space model for all of the reference collections used. As future works, the effectiveness of the proposed model will be compared to the effectiveness of the generalized vector space model. Besides, we will research other methods of obtaining correlation among the terms of a collection of documents. These methods will be incorporated in a geometric way to the model proposed in this paper. We also intend to evaluate the model proposed for larger collections formed by Web documents. References 1. Adriaans, P., Zantige, D. Data Mining. Inglaterra, Addison-Wesley, Agrawal, R., Imielinski, T., Swami, A. Mining association rules between sets of items in large databases. Proceedings of the ACM SIGMOD Conference. Washington, DC, USA, p , may Agrawal, R., Srikant, R. Fast algorithms for mining association rules. Proceedings of the 20th Int l Conference on Very Large Databases. Santiago, Chile, September Baeza-Yates, R., Ribeiro-Neto, B. Modern information retrieval. ACM/Addison-Wesley, Becker, J., Kuropka, D. Topic-based vector space model. Proceedings of the 6th International Conference on Business Information Systems, Colorado Springs, June 2003, p

11 6. Bollmann-Sdorra, P., Raghavan, V. V. On the necessity of term dependence in a query space for weighted retrieval. Journal of the American Society of Information Science, 49(13): , Buckley, C., Salton, G., Allan, J., Singhal, A. Automatic query expansion using SMART : TREC 3. In D. K. Harmon, editor, NIST Special Publication : The Third Text Retrieval conference (TREC 3), 1995, p CAM-Collection. ftp://ftp.cs.cornell.edu/pub/smart/cacm. 9. Han, J., Kamber, M. Data mining Concepts and techniques. San Diego: Academic Press, 2001, p Harman, D. Overview of the third Text Retrieval Conference. Proceedings of the third Text Retrieval Conference (TREC-3), Gaithersburg, MD,USA,1995, p Mandala, R., Tokunaga, T., Tanaka, H. M. Combining multiple evidence from different types of thesaurus for query expansion. Proceedings of the 22th annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley, California, United States, August 1999, p Nie, J. Y., Jin, F. Integrating logical operators in query expansion in Vector Space Model. Workshop on Mathematical/Formal Methods in Information Retrieval, 25th ACM-SIGIR, Tampere, Finland, August Pôssas, B, Ziviani, N., Meira-Jr, W., Enhancing the set-based model using proximity information. Proceedings of the 9th International Symposium of String Processing and Information Retrieval, Lisbon, Portugal, September 2002, p Pôssas, B, Ziviani, N., Meira-Jr, W., Ribeiro-Neto, B. Modelagem vetorial estendida por regras de associação. XVI Simpósio Brasileiro de Banco de Dados, Rio de Janeiro, Brasil, Pôssas, B, Ziviani, N., Meira-Jr, W., Ribeiro-Neto, B. Set-based model: A new approach for information retrieval. Proceedings of the 25th International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, August Salton, G. (ed) The SMART retrieval system experiments in automatic document processing. Englewood Cliffs, NJ: Prentice Hall, Salton, G., Lesk, M. E. Computer evaluation of indexing and text processing. Journal of the ACM, 15(1):8-36, Janeiro Salton, G., McGill M. J. Introduction to modern information retrieval. MacGraw Hill, New York, Shaw, W. M., Wood, R. E, Tiboo, H. R. The cystic fibrosis database: Content and research opportunities. Library and Information Science Research,13: , Voorhees E. M. Query expansion using lexical-semantic relations. Proceedings of the 17th ACM- SIGIR Conference, 1993, p Wong, S. K.M., Raghavan, V. V. The vector space model of information retrieval A reevaluation. Proceedings of the 7th annual international ACM SIGIR conference on Research and development in information retrieval, Cambridge, England, Wong, S. K.M., Ziarko, W., Raghavan, V. V., Wong, P. C.N. On modeling of information retrieval concepts in vector spaces. Proceedings of the ACMTransactions on Database Systems Volume 12, New York, NY, USA, June 1987, p Wong, S. K. M., Ziarko W., Wong, P. C. N. Generalized vector space model in information retrieval. Proceedings of the 8th ACM-SIGIR Conference on Research and Development in Information Retrieval. New York, USA, 1985, p

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL Lim Bee Huang 1, Vimala Balakrishnan 2, Ram Gopal Raj 3 1,2 Department of Information System, 3 Department

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Mining Quantitative Association Rules on Overlapped Intervals

Mining Quantitative Association Rules on Overlapped Intervals Mining Quantitative Association Rules on Overlapped Intervals Qiang Tong 1,3, Baoping Yan 2, and Yuanchun Zhou 1,3 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China {tongqiang,

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach ABSTRACT G.Ravi Kumar 1 Dr.G.A. Ramachandra 2 G.Sunitha 3 1. Research Scholar, Department of Computer Science &Technology,

More information

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts Kwangcheol Shin 1, Sang-Yong Han 1, and Alexander Gelbukh 1,2 1 Computer Science and Engineering Department, Chung-Ang University,

More information

Maximal Termsets as a Query Structuring Mechanism

Maximal Termsets as a Query Structuring Mechanism Maximal Termsets as a Query Structuring Mechanism ABSTRACT Bruno Pôssas Federal University of Minas Gerais 30161-970 Belo Horizonte-MG, Brazil bavep@dcc.ufmg.br Berthier Ribeiro-Neto Federal University

More information

Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm

Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm Expert Systems: Final (Research Paper) Project Daniel Josiah-Akintonde December

More information

Making Retrieval Faster Through Document Clustering

Making Retrieval Faster Through Document Clustering R E S E A R C H R E P O R T I D I A P Making Retrieval Faster Through Document Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-02 January 23, 2004 D a l l e M o l l e I n s t i t u t e

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 3 Modeling Part I: Classic Models Introduction to IR Models Basic Concepts The Boolean Model Term Weighting The Vector Model Probabilistic Model Chap 03: Modeling,

More information

QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL

QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL David Parapar, Álvaro Barreiro AILab, Department of Computer Science, University of A Coruña, Spain dparapar@udc.es, barreiro@udc.es

More information

Percent Perfect Performance (PPP)

Percent Perfect Performance (PPP) Percent Perfect Performance (PPP) Information Processing & Management, 43 (4), 2007, 1020-1029 Robert M. Losee CB#3360 University of North Carolina Chapel Hill, NC 27599-3360 email: losee at unc period

More information

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014. A B S T R A C T International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Information Retrieval Models and Searching Methodologies: Survey Balwinder Saini*,Vikram Singh,Satish

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections

More information

A Universal Model for XML Information Retrieval

A Universal Model for XML Information Retrieval A Universal Model for XML Information Retrieval Maria Izabel M. Azevedo 1, Lucas Pantuza Amorim 2, and Nívio Ziviani 3 1 Department of Computer Science, State University of Montes Claros, Montes Claros,

More information

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A thesis Submitted to the faculty of the graduate school of the University of Minnesota by Vamshi Krishna Thotempudi In partial fulfillment of the requirements

More information

A mining method for tracking changes in temporal association rules from an encoded database

A mining method for tracking changes in temporal association rules from an encoded database A mining method for tracking changes in temporal association rules from an encoded database Chelliah Balasubramanian *, Karuppaswamy Duraiswamy ** K.S.Rangasamy College of Technology, Tiruchengode, Tamil

More information

A New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval

A New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval Information and Management Sciences Volume 18, Number 4, pp. 299-315, 2007 A New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval Liang-Yu Chen National Taiwan University

More information

Discovering interesting rules from financial data

Discovering interesting rules from financial data Discovering interesting rules from financial data Przemysław Sołdacki Institute of Computer Science Warsaw University of Technology Ul. Andersa 13, 00-159 Warszawa Tel: +48 609129896 email: psoldack@ii.pw.edu.pl

More information

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate Searching Information Servers Based on Customized Proles Technical Report USC-CS-96-636 Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

ABSTRACT. VENKATESH, JAYASHREE. Pairwise Document Similarity using an Incremental Approach to TF-IDF. (Under the direction of Dr. Christopher Healey.

ABSTRACT. VENKATESH, JAYASHREE. Pairwise Document Similarity using an Incremental Approach to TF-IDF. (Under the direction of Dr. Christopher Healey. ABSTRACT VENKATESH, JAYASHREE. Pairwise Document Similarity using an Incremental Approach to TF-IDF. (Under the direction of Dr. Christopher Healey.) Advances in information and communication technologies

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 5 Relevance Feedback and Query Expansion Introduction A Framework for Feedback Methods Explicit Relevance Feedback Explicit Feedback Through Clicks Implicit Feedback

More information

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3

More information

A Content Vector Model for Text Classification

A Content Vector Model for Text Classification A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.

More information

Using Query History to Prune Query Results

Using Query History to Prune Query Results Using Query History to Prune Query Results Daniel Waegel Ursinus College Department of Computer Science dawaegel@gmail.com April Kontostathis Ursinus College Department of Computer Science akontostathis@ursinus.edu

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

A Conflict-Based Confidence Measure for Associative Classification

A Conflict-Based Confidence Measure for Associative Classification A Conflict-Based Confidence Measure for Associative Classification Peerapon Vateekul and Mei-Ling Shyu Department of Electrical and Computer Engineering University of Miami Coral Gables, FL 33124, USA

More information

Association Rule Mining. Entscheidungsunterstützungssysteme

Association Rule Mining. Entscheidungsunterstützungssysteme Association Rule Mining Entscheidungsunterstützungssysteme Frequent Pattern Analysis Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree Virendra Kumar Shrivastava 1, Parveen Kumar 2, K. R. Pardasani 3 1 Department of Computer Science & Engineering, Singhania

More information

Mining Frequent Patterns with Counting Inference at Multiple Levels

Mining Frequent Patterns with Counting Inference at Multiple Levels International Journal of Computer Applications (097 7) Volume 3 No.10, July 010 Mining Frequent Patterns with Counting Inference at Multiple Levels Mittar Vishav Deptt. Of IT M.M.University, Mullana Ruchika

More information

Data Mining Part 3. Associations Rules

Data Mining Part 3. Associations Rules Data Mining Part 3. Associations Rules 3.2 Efficient Frequent Itemset Mining Methods Fall 2009 Instructor: Dr. Masoud Yaghini Outline Apriori Algorithm Generating Association Rules from Frequent Itemsets

More information

Discovery of Multi Dimensional Quantitative Closed Association Rules by Attributes Range Method

Discovery of Multi Dimensional Quantitative Closed Association Rules by Attributes Range Method Discovery of Multi Dimensional Quantitative Closed Association Rules by Attributes Range Method Preetham Kumar, Ananthanarayana V S Abstract In this paper we propose a novel algorithm for discovering multi

More information

Improved Frequent Pattern Mining Algorithm with Indexing

Improved Frequent Pattern Mining Algorithm with Indexing IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 16, Issue 6, Ver. VII (Nov Dec. 2014), PP 73-78 Improved Frequent Pattern Mining Algorithm with Indexing Prof.

More information

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LVII, Number 4, 2012 CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL IOAN BADARINZA AND ADRIAN STERCA Abstract. In this paper

More information

Association Rule Mining from XML Data

Association Rule Mining from XML Data 144 Conference on Data Mining DMIN'06 Association Rule Mining from XML Data Qin Ding and Gnanasekaran Sundarraj Computer Science Program The Pennsylvania State University at Harrisburg Middletown, PA 17057,

More information

Discovering the Association Rules in OLAP Data Cube with Daily Downloads of Folklore Materials *

Discovering the Association Rules in OLAP Data Cube with Daily Downloads of Folklore Materials * Discovering the Association Rules in OLAP Data Cube with Daily Downloads of Folklore Materials * Galina Bogdanova, Tsvetanka Georgieva Abstract: Association rules mining is one kind of data mining techniques

More information

Pseudo-Relevance Feedback and Title Re-Ranking for Chinese Information Retrieval

Pseudo-Relevance Feedback and Title Re-Ranking for Chinese Information Retrieval Pseudo-Relevance Feedback and Title Re-Ranking Chinese Inmation Retrieval Robert W.P. Luk Department of Computing The Hong Kong Polytechnic University Email: csrluk@comp.polyu.edu.hk K.F. Wong Dept. Systems

More information

Mining High Order Decision Rules

Mining High Order Decision Rules Mining High Order Decision Rules Y.Y. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 e-mail: yyao@cs.uregina.ca Abstract. We introduce the notion of high

More information

An Algorithm for Frequent Pattern Mining Based On Apriori

An Algorithm for Frequent Pattern Mining Based On Apriori An Algorithm for Frequent Pattern Mining Based On Goswami D.N.*, Chaturvedi Anshu. ** Raghuvanshi C.S.*** *SOS In Computer Science Jiwaji University Gwalior ** Computer Application Department MITS Gwalior

More information

X. A Relevance Feedback System Based on Document Transformations. S. R. Friedman, J. A. Maceyak, and S. F. Weiss

X. A Relevance Feedback System Based on Document Transformations. S. R. Friedman, J. A. Maceyak, and S. F. Weiss X-l X. A Relevance Feedback System Based on Document Transformations S. R. Friedman, J. A. Maceyak, and S. F. Weiss Abstract An information retrieval system using relevance feedback to modify the document

More information

Mining Generalized Sequential Patterns using Genetic Programming

Mining Generalized Sequential Patterns using Genetic Programming Mining Generalized Sequential Patterns using Genetic Programming Sandra de Amo Universidade Federal de Uberlândia Faculdade de Computação Uberlândia MG - Brazil deamo@ufu.br Ary dos Santos Rocha Jr. Universidade

More information

Purna Prasad Mutyala et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 2 (5), 2011,

Purna Prasad Mutyala et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 2 (5), 2011, Weighted Association Rule Mining Without Pre-assigned Weights PURNA PRASAD MUTYALA, KUMAR VASANTHA Department of CSE, Avanthi Institute of Engg & Tech, Tamaram, Visakhapatnam, A.P., India. Abstract Association

More information

EVALUATING GENERALIZED ASSOCIATION RULES THROUGH OBJECTIVE MEASURES

EVALUATING GENERALIZED ASSOCIATION RULES THROUGH OBJECTIVE MEASURES EVALUATING GENERALIZED ASSOCIATION RULES THROUGH OBJECTIVE MEASURES Veronica Oliveira de Carvalho Professor of Centro Universitário de Araraquara Araraquara, São Paulo, Brazil Student of São Paulo University

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 3 Retrieval Evaluation Retrieval Performance Evaluation Reference Collections CFC: The Cystic Fibrosis Collection Retrieval Evaluation, Modern Information Retrieval,

More information

vector space retrieval many slides courtesy James Amherst

vector space retrieval many slides courtesy James Amherst vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the

More information

An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets

An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.8, August 2008 121 An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets

More information

The Effect of Word Sampling on Document Clustering

The Effect of Word Sampling on Document Clustering The Effect of Word Sampling on Document Clustering OMAR H. KARAM AHMED M. HAMAD SHERIN M. MOUSSA Department of Information Systems Faculty of Computer and Information Sciences University of Ain Shams,

More information

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT A Patent Retrieval Method Using a Hierarchy of Clusters at TUT Hironori Doi Yohei Seki Masaki Aono Toyohashi University of Technology 1-1 Hibarigaoka, Tenpaku-cho, Toyohashi-shi, Aichi 441-8580, Japan

More information

Association Rule Mining. Introduction 46. Study core 46

Association Rule Mining. Introduction 46. Study core 46 Learning Unit 7 Association Rule Mining Introduction 46 Study core 46 1 Association Rule Mining: Motivation and Main Concepts 46 2 Apriori Algorithm 47 3 FP-Growth Algorithm 47 4 Assignment Bundle: Frequent

More information

Using Statistical Properties of Text to Create. Metadata. Computer Science and Electrical Engineering Department

Using Statistical Properties of Text to Create. Metadata. Computer Science and Electrical Engineering Department Using Statistical Properties of Text to Create Metadata Grace Crowder crowder@cs.umbc.edu Charles Nicholas nicholas@cs.umbc.edu Computer Science and Electrical Engineering Department University of Maryland

More information

Mining of Web Server Logs using Extended Apriori Algorithm

Mining of Web Server Logs using Extended Apriori Algorithm International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Emerging Technologies in Computational

More information

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.4, April 2009 349 WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY Mohammed M. Sakre Mohammed M. Kouta Ali M. N. Allam Al Shorouk

More information

Handling Missing Values via Decomposition of the Conditioned Set

Handling Missing Values via Decomposition of the Conditioned Set Handling Missing Values via Decomposition of the Conditioned Set Mei-Ling Shyu, Indika Priyantha Kuruppu-Appuhamilage Department of Electrical and Computer Engineering, University of Miami Coral Gables,

More information

DIVERSITY-BASED INTERESTINGNESS MEASURES FOR ASSOCIATION RULE MINING

DIVERSITY-BASED INTERESTINGNESS MEASURES FOR ASSOCIATION RULE MINING DIVERSITY-BASED INTERESTINGNESS MEASURES FOR ASSOCIATION RULE MINING Huebner, Richard A. Norwich University rhuebner@norwich.edu ABSTRACT Association rule interestingness measures are used to help select

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval

More information

Reactive Ranking for Cooperative Databases

Reactive Ranking for Cooperative Databases Reactive Ranking for Cooperative Databases Berthier A. Ribeiro-Neto Guilherme T. Assis Computer Science Department Federal University of Minas Gerais Brazil berthiertavares @dcc.ufmg.br Abstract A cooperative

More information

A Novel Texture Classification Procedure by using Association Rules

A Novel Texture Classification Procedure by using Association Rules ITB J. ICT Vol. 2, No. 2, 2008, 03-4 03 A Novel Texture Classification Procedure by using Association Rules L. Jaba Sheela & V.Shanthi 2 Panimalar Engineering College, Chennai. 2 St.Joseph s Engineering

More information

An Apriori-like algorithm for Extracting Fuzzy Association Rules between Keyphrases in Text Documents

An Apriori-like algorithm for Extracting Fuzzy Association Rules between Keyphrases in Text Documents An Apriori-lie algorithm for Extracting Fuzzy Association Rules between Keyphrases in Text Documents Guy Danon Department of Information Systems Engineering Ben-Gurion University of the Negev Beer-Sheva

More information

A Comparative Study of Association Rules Mining Algorithms

A Comparative Study of Association Rules Mining Algorithms A Comparative Study of Association Rules Mining Algorithms Cornelia Győrödi *, Robert Győrödi *, prof. dr. ing. Stefan Holban ** * Department of Computer Science, University of Oradea, Str. Armatei Romane

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Tadeusz Morzy, Maciej Zakrzewicz

Tadeusz Morzy, Maciej Zakrzewicz From: KDD-98 Proceedings. Copyright 998, AAAI (www.aaai.org). All rights reserved. Group Bitmap Index: A Structure for Association Rules Retrieval Tadeusz Morzy, Maciej Zakrzewicz Institute of Computing

More information

Using Association Rules for Better Treatment of Missing Values

Using Association Rules for Better Treatment of Missing Values Using Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine Intelligence Group) National University

More information

A NEW ASSOCIATION RULE MINING BASED ON FREQUENT ITEM SET

A NEW ASSOCIATION RULE MINING BASED ON FREQUENT ITEM SET A NEW ASSOCIATION RULE MINING BASED ON FREQUENT ITEM SET Ms. Sanober Shaikh 1 Ms. Madhuri Rao 2 and Dr. S. S. Mantha 3 1 Department of Information Technology, TSEC, Bandra (w), Mumbai s.sanober1@gmail.com

More information

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer

More information

A New Technique to Optimize User s Browsing Session using Data Mining

A New Technique to Optimize User s Browsing Session using Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

Optimization using Ant Colony Algorithm

Optimization using Ant Colony Algorithm Optimization using Ant Colony Algorithm Er. Priya Batta 1, Er. Geetika Sharmai 2, Er. Deepshikha 3 1Faculty, Department of Computer Science, Chandigarh University,Gharaun,Mohali,Punjab 2Faculty, Department

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Ricardo Baeza-Yates Berthier Ribeiro-Neto ACM Press NewYork Harlow, England London New York Boston. San Francisco. Toronto. Sydney Singapore Hong Kong Tokyo Seoul Taipei. New

More information

Similarity search in multimedia databases

Similarity search in multimedia databases Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:

More information

A Search Relevancy Tuning Method Using Expert Results Content Evaluation

A Search Relevancy Tuning Method Using Expert Results Content Evaluation A Search Relevancy Tuning Method Using Expert Results Content Evaluation Boris Mark Tylevich Chair of System Integration and Management Moscow Institute of Physics and Technology Moscow, Russia email:boris@tylevich.ru

More information

Retrieval Evaluation

Retrieval Evaluation Retrieval Evaluation - Reference Collections Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, Chapter

More information

Associating Terms with Text Categories

Associating Terms with Text Categories Associating Terms with Text Categories Osmar R. Zaïane Department of Computing Science University of Alberta Edmonton, AB, Canada zaiane@cs.ualberta.ca Maria-Luiza Antonie Department of Computing Science

More information

Performance Measures for Multi-Graded Relevance

Performance Measures for Multi-Graded Relevance Performance Measures for Multi-Graded Relevance Christian Scheel, Andreas Lommatzsch, and Sahin Albayrak Technische Universität Berlin, DAI-Labor, Germany {christian.scheel,andreas.lommatzsch,sahin.albayrak}@dai-labor.de

More information

Temporal Weighted Association Rule Mining for Classification

Temporal Weighted Association Rule Mining for Classification Temporal Weighted Association Rule Mining for Classification Purushottam Sharma and Kanak Saxena Abstract There are so many important techniques towards finding the association rules. But, when we consider

More information

Data Mining: Mining Association Rules. Definitions. .. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Data Mining: Mining Association Rules. Definitions. .. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. .. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Data Mining: Mining Association Rules Definitions Market Baskets. Consider a set I = {i 1,...,i m }. We call the elements of I, items.

More information

Modeling the Real World for Data Mining: Granular Computing Approach

Modeling the Real World for Data Mining: Granular Computing Approach Modeling the Real World for Data Mining: Granular Computing Approach T. Y. Lin Department of Mathematics and Computer Science San Jose State University San Jose California 95192-0103 and Berkeley Initiative

More information

Performance Based Study of Association Rule Algorithms On Voter DB

Performance Based Study of Association Rule Algorithms On Voter DB Performance Based Study of Association Rule Algorithms On Voter DB K.Padmavathi 1, R.Aruna Kirithika 2 1 Department of BCA, St.Joseph s College, Thiruvalluvar University, Cuddalore, Tamil Nadu, India,

More information

ARCHITECTURE AND IMPLEMENTATION OF A NEW USER INTERFACE FOR INTERNET SEARCH ENGINES

ARCHITECTURE AND IMPLEMENTATION OF A NEW USER INTERFACE FOR INTERNET SEARCH ENGINES ARCHITECTURE AND IMPLEMENTATION OF A NEW USER INTERFACE FOR INTERNET SEARCH ENGINES Fidel Cacheda, Alberto Pan, Lucía Ardao, Angel Viña Department of Tecnoloxías da Información e as Comunicacións, Facultad

More information

A recommendation engine by using association rules

A recommendation engine by using association rules Available online at www.sciencedirect.com Procedia - Social and Behavioral Sciences 62 ( 2012 ) 452 456 WCBEM 2012 A recommendation engine by using association rules Ozgur Cakir a 1, Murat Efe Aras b a

More information

Concept-Based Interactive Query Expansion

Concept-Based Interactive Query Expansion Concept-Based Interactive Query Expansion Bruno M. Fonseca 12 maciel@dcc.ufmg.br Paulo Golgher 2 golgher@akwan.com.br Bruno Pôssas 12 bavep@akwan.com.br Berthier Ribeiro-Neto 1 2 berthier@dcc.ufmg.br Nivio

More information

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Outline Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Lecture 10 CS 410/510 Information Retrieval on the Internet Query reformulation Sources of relevance for feedback Using

More information

Inverted List Caching for Topical Index Shards

Inverted List Caching for Topical Index Shards Inverted List Caching for Topical Index Shards Zhuyun Dai and Jamie Callan Language Technologies Institute, Carnegie Mellon University {zhuyund, callan}@cs.cmu.edu Abstract. Selective search is a distributed

More information

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING 41 CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING 3.1 INTRODUCTION This chapter describes the clustering process based on association rule mining. As discussed in the introduction, clustering algorithms have

More information

Information Retrieval. Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent

More information

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline Relevance Feedback and Query Reformulation Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price IR on the Internet, Spring 2010 1 Outline Query reformulation Sources of relevance

More information

Document Expansion for Text-based Image Retrieval at CLEF 2009

Document Expansion for Text-based Image Retrieval at CLEF 2009 Document Expansion for Text-based Image Retrieval at CLEF 2009 Jinming Min, Peter Wilkins, Johannes Leveling, and Gareth Jones Centre for Next Generation Localisation School of Computing, Dublin City University

More information

Using Coherence-based Measures to Predict Query Difficulty

Using Coherence-based Measures to Predict Query Difficulty Using Coherence-based Measures to Predict Query Difficulty Jiyin He, Martha Larson, and Maarten de Rijke ISLA, University of Amsterdam {jiyinhe,larson,mdr}@science.uva.nl Abstract. We investigate the potential

More information

A Technical Analysis of Market Basket by using Association Rule Mining and Apriori Algorithm

A Technical Analysis of Market Basket by using Association Rule Mining and Apriori Algorithm A Technical Analysis of Market Basket by using Association Rule Mining and Apriori Algorithm S.Pradeepkumar*, Mrs.C.Grace Padma** M.Phil Research Scholar, Department of Computer Science, RVS College of

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

Mining Spatial Gene Expression Data Using Association Rules

Mining Spatial Gene Expression Data Using Association Rules Mining Spatial Gene Expression Data Using Association Rules M.Anandhavalli Reader, Department of Computer Science & Engineering Sikkim Manipal Institute of Technology Majitar-737136, India M.K.Ghose Prof&Head,

More information

An Improved Apriori Algorithm for Association Rules

An Improved Apriori Algorithm for Association Rules Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan

More information

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University

More information

K-Means Clustering With Initial Centroids Based On Difference Operator

K-Means Clustering With Initial Centroids Based On Difference Operator K-Means Clustering With Initial Centroids Based On Difference Operator Satish Chaurasiya 1, Dr.Ratish Agrawal 2 M.Tech Student, School of Information and Technology, R.G.P.V, Bhopal, India Assistant Professor,

More information

Finding the boundaries of attributes domains of quantitative association rules using abstraction- A Dynamic Approach

Finding the boundaries of attributes domains of quantitative association rules using abstraction- A Dynamic Approach 7th WSEAS International Conference on APPLIED COMPUTER SCIENCE, Venice, Italy, November 21-23, 2007 52 Finding the boundaries of attributes domains of quantitative association rules using abstraction-

More information

Knowledge Engineering in Search Engines

Knowledge Engineering in Search Engines San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2012 Knowledge Engineering in Search Engines Yun-Chieh Lin Follow this and additional works at:

More information

Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets

Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets American Journal of Applied Sciences 2 (5): 926-931, 2005 ISSN 1546-9239 Science Publications, 2005 Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets 1 Ravindra Patel, 2 S.S.

More information