A New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval

Size: px
Start display at page:

Download "A New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval"

Transcription

1 Information and Management Sciences Volume 18, Number 4, pp , 2007 A New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval Liang-Yu Chen National Taiwan University of Science and Technology R.O.C. Shyi-Ming Chen National Taiwan University of Science and Technology R.O.C. Abstract In this paper, we present a new approach for automatic thesaurus construction and query expansion for document retrieval. We analyze the information between any two terms in each document cluster center of final document clusters or intermediate document clusters in the clustering process to automatically construct the thesaurus, where these information includes the co-occurrence frequency of any two terms in each document cluster center, the degree of effect of each term in each document cluster center and the inner noise of each document cluster, respectively. We also present a query expansion method to expand the user s queries and present a new method to calculate the degree of similarity between the user s query and documents. The proposed thesaurus construction method and the proposed query expansion method can improve the performance of information retrieval systems for dealing with document retrieval. Keywords: Document Retrieval, Query Expansion, Thesaurus Construction, Query Terms, Vector Space Models, Document Clusters. 1. Introduction Thesaurus is commonly used in information retrieval (IR) systems [1], where a thesaurus is composed of a set of terms (phrases or words) plus a set of relationships between these terms. The systems can deal with users queries expansion based on the constructed thesaurus. There are two types of thesaurus, i.e., the manual constructed thesaurus and the automatic constructed thesaurus. The manual constructed thesaurus is constructed by some domain experts, which define the relationships between any two terms. The major problem of manual thesaurus is that they are expensive to build and hard to update. Furthermore, even the same expert may define different relationships between two Received September 2005; Revised January 2006; Accepted March Supported in part by the National Science Council, Republic of China, under Grant NSC E

2 300 Information and Management Sciences, Vol. 18, No. 4, December, 2007 terms at different times. By contrast, the construction of automatic thesaurus is more objective. Different methods have been proposed for constructing the thesaurus, e.g., the Similarity Thesauri [18] and the Phrasefinder [14]. Existing techniques for automatic query expansion can be categorized as either global or local. The local query expansion technique uses a small number of retrieved top-ranked documents of a query to expand the query [3], [8]. But if only a few of the top-ranked documents retrieved by the original user s query are relevant, the retrieval performance will seriously decrease. In recent years, query expansion methods based on user s relevance feedback have been proposed [4], [13] which analyze the relevant documents filtered by the user to deal with query expansion for improving the retrieval performance. The global query expansion technique requires some statistics, which take a considerable amount of computer resources to compute, such as the co-occurrence data about all possible pairs of terms in a corpus. One of the earliest global techniques is the term-clustering technique [15], which groups words into clusters based on their co-occurrences and uses the clusters for query expansion. In [2], Billhardt et al. presented a context vector model for information retrieval. In [10], He et al. presented a mining process to extract document cluster knowledge from the Web Citation Database to support the retrieval of Web publications. In [16], Kalczynski and Chu presented a temporal document retrieval model for business news archives, where the classical vector space model is extended to the temporal document retrieval model that incorporates the fuzzy representations of temporal expressions. In this paper, we present a new approach for automatic thesaurus construction and query expansion for document retrieval. We analyze the information between any two terms in each document cluster center of final document clusters or intermediate document clusters in the clustering process to automatically construct the thesaurus, where these information includes the co-occurrence frequency of any two terms in each document cluster center, the degree of effect of each term in each document cluster center and the inner noise of each document cluster, respectively. We also present a query expansion method to expand the user s queries and present a new method to calculate the degree of similarity between the user s query and documents. The proposed thesaurus construction method and the proposed query expansion method can improve the performance of information retrieval systems for dealing with document retrieval. The rest of this paper is organized as follows. In Section 2, we briefly review the vector space model [19] in information retrieval systems and briefly review the document

3 A New Approach for Automatic Thesaurus Construction and Query Expansion 301 cluster method we presented in [5]. In Section 3, we present a new method for automatic thesaurus construction based on document clusters. In Section 4, we present a new query expansion method for document retrieval based on constructed thesaurus. In Section 5, we analyze the experimental results of the proposed method. The conclusions are discussed in Section Preliminaries In the vector space model [19], a document d k can be represented as a document vector d k = w 1k,w 2k,...,w nk, where n denotes the number of the terms appearing in document d k, and w ik denotes the weight of term t i in document d k. In [19], Salton used formula (1) to calculate the weight w ik of term t i in document d k. Formula (2) is the Inverse Document Frequency [19]: w ik = Max j tf ik tf jk IDF i, (1) IDF i = log 10 N n i, (2) where IDF i denotes the inverse document frequency of term t i, N denotes the number of documents in the database, n i denotes the number of documents which contain term t i, and tf ik denotes the frequency of term t i appearing in document d k. In [5], we have presented a fuzzy hierarchical clustering method based on dynamic document cluster centers to cluster documents. We used terms in documents to construct a document cluster center of documents. The number of terms in a document cluster center will be different when document clusters are merged. The terms in a document cluster center will affect the degree of similarity between two document clusters. In the following, we briefly describe some characteristics of a dynamic document cluster center [5]: (1) During the generating or merging process of a document cluster, every term in the document cluster center is associated with a value of relative time to live (RTL), where the RTL value is the main factor to determine whether a term can stay in the document cluster center or not. (2) During the generating or merging process of a document cluster, every term in the document cluster center is associated with a value of degree of effect. The higher

4 302 Information and Management Sciences, Vol. 18, No. 4, December, 2007 value of degree of effect, the more significant the term with respect to the document cluster. (3) The number of terms in a document cluster center will increase or decrease dynamically depending on the merge of single document clusters or multiple document clusters in the merging process. In [5], we used formula (3) to calculate the degree of similarity sim(c i,c j ) between two document clusters C i and C j : sim(c i,c j ) = where A = k=1,2,...,s k=1,2,...,s Min(v ik,v jk )T(w ik,w jk ) k=1,2,...,s v ik, B = k=1,2,...,s T(w ik,w jk ) = 1 w ik w jk, Max(v ik,v jk ) v jk, S = k=1,2,...,s 2 S A + B, (3) Min(v ik,v jk ), w ik denotes the weight of term t k in cluster C i, w jk denotes the weight of term t k in cluster C j, v ik denotes the value of the degree of effect of term t k in cluster C i, v jk denotes the value of the degree of effect of term t k in cluster C j, and s denotes the number of identical terms. In formula (3), we considered the weight and the degree of effect of a term to calculate the degree of similarity between two document clusters. 3. An Automatic Thesaurus Construction Algorithm In this section, we present an automatic thesaurus construction algorithm. The thesaurus is constructed into a network structure based on the documents clustering techniques we presented in [5]. In the constructed thesaurus network, every term can be represented as a node and the relationship between any two terms can be represented by a link associated with a degree of relationship between these two terms. The degree of relationship is calculated, and a higher degree of relationship between two terms indicates that there is a stronger relationship between the two terms. There are some automatic global thesaurus construction methods based on document clusters. Some of them only consider the intermediate clusters (i.e., the document clusters formed in the middle of the clustering process) [9]; others only analyze the final clusters (i.e., the document clusters formed in the final stage of the clustering process) [16]. In the following, we describe the characteristics of a document cluster in the clustering process. In general, there are

5 A New Approach for Automatic Thesaurus Construction and Query Expansion 303 usually a few documents in the intermediate clusters and the similarity between any two documents is usually high. In another words, the documents in intermediate clusters are usually closely related. On the other hand, the final clusters usually have more documents than intermediate clusters and the degree of similarity between two documents is usually low. However, the information may be lost by the system whether it considered the intermediate clusters or final clusters. For example, it is hard to categorize all documents belonging to the same category to one cluster. In general, when the numbers of documents in a document cluster increase, the inner noises (i.e., irrelevant documents) in a document cluster will increase at the same time. Therefore, the method that only considers final clusters usually cannot extract the most closely related terms and cannot precisely establish the degree of relationship between the related terms. On the other hand, if a method only considers intermediate clusters, then it usually cannot extract related and important terms because intermediate clusters do not contain enough documents. In this paper, we present an automatic thesaurus construction algorithm based on all of the document clusters except initial clusters. In order to calculate the degree of relationship between any two terms more precisely, we also present a method to calculate the degree of relationship between any two terms based on the co-occurrence frequency of these two terms and the degree of effect of a term in a document cluster. The proposed automatic thesaurus construction algorithm consists of the following two parts: (1) Link Generation and (2) Calculate the degree of relationship between any two terms: (1) Link Generation: We define a parameter γ, called the ratio of co-occurrence frequency, in a document cluster to determine whether there is a link between terms t x and t y in a document cluster center C k or not, shown as follows: γ = Num d Num c, (4) where γ [0,1], Num d denotes the number of documents in document cluster C k in which both term t x and term t y appear in document cluster C k ; Num c denotes the number of documents in document cluster C k. If the value γ of the ratio of co-occurrence frequency in a document cluster between two terms t x and t y in document cluster C k is larger than the user-supplied parameter α, where α is called the threshold value of the ratio of co-occurrence frequency in a document cluster, then the system will generate a link between terms t x and t y. Otherwise, the system will not generate a link between the two terms.

6 304 Information and Management Sciences, Vol. 18, No. 4, December, 2007 (2) Calculating the Degree of Relationship between any Two Terms: Because we consider the intermediate clusters and the final clusters in the clustering process when constructing the thesaurus, once the link between two terms has been generated, we have to provide a method to calculate the degree of relationship between the two terms in the document cluster center precisely. The number of documents in a document cluster and the number of terms in a document cluster center can affect the calculation of the value of the degree of relationship between any two terms. In the following, we consider three factors that will affect the calculation of the degree of relationship between any two terms: (i) Inner Noises in a Document Cluster (Irrelevant Documents in a Document Cluster): Once term t x and term t y both appear in document cluster center C k, we apply c k, called the average of the distances of instances in the same category [7], as shown in formula (5), and define c kp, called the average of the distances of partial instances in the same category, as shown formula (6), to measure the inner noise with respect to these two terms in this document cluster: c k = c kp = Numc 1 Numc l=1 m=l+1 µ k(d l ) µ k (d m) sim(d l,d m) Numc 1 Numc µ l=1 k (d l ) m=l+1 µ k(d m) T v Numc 1 Numd T v l=1 Numd 1 l=1 m=l+1 µ k(d l ) µ k (d m ) sim(d l,d m ) Num µ k (d l ) d m=l+1 µ k(d m ) if Num c > 1 otherwise, if Num d > 1 otherwise, where T v is a parameter representing the variance for a single-instance category [7], d l and d m are any two documents in document cluster C k ; Num c denotes the number of documents in document cluster C k ; Num d denotes the number of documents which contain terms t x and t y in document cluster C k ; d l and d m are documents which both contain terms t x and t y ; µ k (d l ) denotes the degree of membership of document d l with respect to a document cluster C k and µ k (d m ) denotes the degree of membership of document d m with respect to document cluster C k ; sim(d l,d m ) is the similarity measure shown in formula (3) to calculate the degree of similarity between documents d l and d m. By observing the ratio between c kp and c k, i.e., c kp / c k, we can know whether there are inner noises in a document cluster with respect to terms t x and t y or not. For example, for terms t x and t y, if the ratio c kp / c k is lager than 1, then it indicates that there are inner noises in this (5) (6)

7 A New Approach for Automatic Thesaurus Construction and Query Expansion 305 document cluster with respect to terms t x and t y. Furthermore, in this paper, we also consider the fact that the number of documents in a document cluster may affect the degree of relationship between any two terms. Therefore, we use the ratio between Num d and Num c to adjust the degree of inner noises to derive formula (7): c kp c k Num d Num c, (7) where Num d denotes the number of documents which both contain terms t x and t y in a document cluster C k ; Num c denotes the number of documents in a document cluster C k. (ii) The Degree of Effect between any Two Terms in a Document Cluster Center: In the clustering process, a document cluster will have different terms in it s cluster center when different document clusters are merged, and the degree of effect between any two terms in the document cluster center can also affect the calculation of the degree of relationship between any two terms. If a term has a larger value of the degree of effect in a document cluster center, then it is more significant in this document cluster. However, the number of documents in a document cluster will affect the distribution of the values of the degrees of effect of terms in the document cluster center. In general, the more documents in a document cluster, the larger the difference between the values of the degree of effect of terms. In order to analyze the difference between the values of the degree of effect of terms, the degree of effect of terms can be regarded as a real-value sequence. Then, we consider the Entropy of the real-value sequence [1] and consider the values of the degrees of effect v x and v y of terms t x and t y, respectively, to calculate a partial degree of relationship between terms t x and t y, shown as follows: where E = n i=1 Min(v x,v y ) E Max(v x,v y ) T, (8) v i 1 n k=1 v log k 2 vi, v i denotes the degree of effect of term t i in n k=1 v k the document cluster center, v x and v y are the degrees of effect of terms t x and t y in the document cluster center, respectively, and T denotes the number of terms in the document cluster center. (iii) Co-Occurrence Frequency Analysis: Co-occurrence frequency analysis is always used for constructing a global thesaurus. In this paper, we also consider the

8 306 Information and Management Sciences, Vol. 18, No. 4, December, 2007 co-occurrence of any two terms t x and t y of document cluster C to calculate the degree of relationship between terms t x and t y. After the documents have been clustered, a higher value of co-occurrence with respect to terms t x and t y means that these two terms are more closely related to each other. In this paper, we use formula (9) to measure the information of co-occurrence frequency with respect to terms t x and t y : co occ Max(occ x,occ y ), (9) where occ x denotes the number of documents containing term t x in document cluster C; occ y denotes the number of documents containing term t y in document cluster C; co occ denotes the number of documents containing both terms t x and t y in document cluster C. Based on the discussions of (i), (ii) and (iii), once the ratio of co-occurrence frequency in a document cluster with respect to terms t x and t y in document cluster C k is larger than the threshold value α, we use formula (10) to calculate the degree of relationship δ between terms t x and t y, δ = c kp Num d Min(v x,v y ) E c k Num c Max(v x,v y ) T co occ Max(occ x,occ y ), (10) where formula (10) is a combination of formulas (7), (8) and (9). The automatic thesaurus construction algorithm is now presented as follows. Automatic Thesaurus Construction Algorithm Based on Document Clusters: Input: The threshold value α of the ratio of co-occurrence frequency in a document cluster, where α [0,1]; the intermediate clusters and final clusters C 1,C 2,...,C p of the documents. Output: The constructed thesaurus. Step 1: Initially, set the variables i = 0 and k = 0. Step 2: Let k = k + 1. If k > p then Stop. Otherwise, let document cluster C k be the training document cluster. Assume that there existing n k terms t 1,t 2,...,t nk in the document cluster center C k. Step 3: Let i = i + 1. If i > n k 1 then go to Step 2. Otherwise, let j = 0. Choose term t i from the document cluster C k and perform Step 3.1 to Step 3.5. Step 3.1: Let j = i + 1 and find term t j.

9 A New Approach for Automatic Thesaurus Construction and Query Expansion 307 Step 3.2: Based on formula (4), calculate the ratio of co-occurrence frequency γ between term t i and term t j in the document cluster C k, where γ [0,1]. If γ < α then go to Step 3.3. Otherwise, go to Step 3.4. Step 3.3: If j < n k then go to Step 3.1. Otherwise, go to Step 3. Step 3.4: Check whether there is a link between terms t i and t j or not. If there is a link between terms t i and t j then go to Step 3.5. Otherwise, generate a new link L ti,t j between terms t i and t j, and calculate the degree δ of relationship between terms t i and t j, based on formula (10), where δ [0,1]. If j < n k then go to Step 3.1. Otherwise, go to Step 3. Step 3.5: Assume that the original degree of relationship between terms t i and t j is ω, where ω [0,1]. Based on formula (10), calculate the degree of relationship δ between terms t i and t j in the document cluster C k, where δ [0,1]. Then, let the new degree of relationship between terms t i and t j be equal to ω + δ. If j < n k then go to Step 3.1. Otherwise, go to Step Query Expansion In general, users retrieve documents through information retrieval systems [11]. However, the query terms submitted by the users usually do not provide enough information to retrieve most of the relevant documents. Since the query expansion method has been proposed [15], it has been the main method for improving the performance of information retrieval systems. In this paper, we apply the constructed thesaurus to expand the user query to improve the performance of information retrieval systems. When the user submits his/her query terms, the system chooses the term has the highest IDF value among the query terms as the center of the query expansion terms and chooses the terms having higher degrees of relationship with respect to the center of the query expansion terms as query expansion terms. Then, the system calculates the degree of relationship and the weight of each expansion term. Finally, it replaces the original query terms by expansion terms. In the following, we introduce some parameters and formulas that are needed for the proposed query expansion algorithm: (i) The Number of Expansion Terms β: It is a user-defined parameter, and the system will generate β expansion terms according to parameter β, where β 1. (ii) The calculation of the relevant degree of expansion terms: After the system gets the center t of the query expansion terms, every term t found according to

10 308 Information and Management Sciences, Vol. 18, No. 4, December, 2007 term t must calculate its relevant degree based on formula (11): θ ρ f f, (11) where ρ denotes the highest degree of relationship between the center of query expansion terms t and the other terms; θ denotes the degree of relationship between terms t and t ; f and f are the IDF values of terms t and t, respectively. (iii) The Threshold Value of Filtering Document Clusters ϕ: After finishing the query expansion process, the system will filter some training document clusters (assume that the training clusters are C 1,C 2,...,C o ) and then calculate the weight of each expanded term based on some documents of the document clusters filtered by system. To filter a document cluster C a, it must satisfy formula (12): v C a DC C a ϕ, 1 a o, (12) where v C a denotes the degree of effect of term t in document cluster C a, where t is the center of query expansion terms, and DC C a denotes the number of documents in document cluster C a. (iv) The Calculation of the Weights of Expanded Terms: After the system found the center t of the expansion terms, every expanded term t must calculate the weight based on formula (13): oa=1 p C a s=1 w C a,ds oa=1 p C a s=1 I, (13) where p C a denotes the number of documents in document cluster C a and w C a,ds denotes the weight of term t in document d s of document cluster C a. If term t does not appear in document d s of document cluster C a, then the value of w C a is,ds set to 0. If term t appears in document d s of document cluster C a, then the value of I is set to 1. Otherwise, the value of I is set to 0. The query expansion algorithm is now presented as follows: Query Expansion Algorithm: Input: The number β of expansion terms, the threshold value ϕ for filtering document clusters, and the query-terms set Q = {q 1,q 2,...,q m } are submitted by the user. The IDF vector of the query-terms set Q is IDF Q = f 1,f 2,...,f m, where f i denotes the

11 A New Approach for Automatic Thesaurus Construction and Query Expansion 309 IDF value of query term q i, 1 i m, m β, and ϕ [0,1]; intermediate clusters and final clusters in the clustering process are C 1,C 2,...,C p. Initial variable: i = 0. Output: Query expansion term set K = {e 1,e 2,...,e β } and the relevant degree vector G of term set K, where G = g 1,g 2,...,g β, g i denotes the relevant degree of term e i, and 1 i β, the weight vector W of term set K, where W = w 1,w 2,...,w β, w i denotes the weight of term e i, and 1 i β. Step 1: Choose query term q l that has the largest IDF value form the query term set Q (assume that f l denotes the IDF value of term q l ) and let term q l be the center of the query expansion. Put q l into query expansion term set K and let e l = q l, where 1 l m. Step 2: Let the relevant degree g l of term e l be equal to 1. Step 3: Find term t among the thesaurus that has the largest degree of relationship between term t and term e l, based on the constructed thesaurus (assume that the degree of relationship between terms t and e l is ρ, where ρ [0,1]). Step 4: Put every query term q s in the query term set Q except term q l into K (assume that the link between query term q s and the center e l of the query expansion terms is L el,q s ), such that e s = q s, where 1 s m and s l. Calculate the relevant degree g s of every term e s in the query term set K based on formula (11) and mark the link L el,e s, where 1 s m and s l. Set i = m. If i = β then Stop. Otherwise, go to Step 5. Step 5: Choose term t r from the thesaurus which has the largest relationship between term t r and term e l where the link between term t r and term e l have not been marked yet. Mark the link L el,t r between term t r and the center e l of query expansion terms. Set i = i+1 and let e i = t r. Calculate the relevant degree g i of term e i based on formula (11). If i < β then go to Step 5. Otherwise, go to Step 6. Step 6: Find document clusters from training document clusters C 1,C 2,...,C p for which their cluster center contains term e l, where the degree of effect of term e l is larger than the other terms in the cluster center and where the document clusters must satisfy formula (12). Step 7: Calculate the weight w j of each term e j in the query expansion term set K based on formula (13), where 1 j m. After expanding the original user s query, we apply the proposed similarity calculation algorithm to improve the retrieval performance. The algorithm considers the degrees of

12 310 Information and Management Sciences, Vol. 18, No. 4, December, 2007 relationship of terms generated by the proposed query expansion algorithm as the default degrees of relationship when querying. The degree of relationship of every query term will be changed dynamically according to the previous degree of relationship of the terms. In the following, we present the method to calculate the degree of similarity between the query expansion term set and a document. Assume that the query expansion term set K = {e 1,e 2,...,e β }, where its degree of relationship vector is G = g 1,g 2,...,g β and its weighting vector is W = w 1,w 2,...,w β. The formula to calculate the degree of similarity between the query expansion term set K and document d k is shown as follows: β j=1 g j T(w j,w j,dk ) β j=1 g, (14) j where w j denotes the weight of term e j in W; w j,dk denotes the weight of term e j in document d k ; T(w j,w j,dk ) [12] calculates the degree of similarity between w j and w j,dk ; g j denotes the dynamic degree of relationship of term e j. The formula used to calculate g j is shown as follows: j 1 g j = g j + k=1 where if g j > 1, then let g j = 1; if g j < 0, then let g j = 0. ((T(w k,w k,dj ) g k ) g j ), (15) The proposed similarity calculation algorithm is now presented as follows: The Similarity Calculation Algorithm: Input: Query expansion term set K = {e 1,e 2,...,e β }, its degree of relationship vector G = g 1,g 2,...,g β, and its weighting vector W = w 1,w 2,...,w β }. Output: The degree of similarity between each document and the query expansion term set K. Step 1: Sort the terms in the query expansion term set K according to their degree of relationship in a descending sequence to form a new query expansion term set K = {e 1,e 2,...,e β }, where its degree of effect vector is G = g 1,g 2,...,g β, g 1 g 2 g β, its weighting vector is W = w 1,w 2,...,w β, and g 1 g 2 g β. Step 2: Based on formulas (14) and (15), calculate the degree of similarity between the query expansion term set and each document. Sort documents according to their degrees of similarity in a descending sequence for the user s browsing.

13 A New Approach for Automatic Thesaurus Construction and Query Expansion 311 In summary, we use a flowchart to illustrate the application of the three proposed algorithms for document retrieval as shown in Figure 1. Figure 1. The flowchart of the proposed method. 5. Experimental Results We have implemented the proposed method on a Pentium 4 PC using Delphi Version 5.0. We choose 292 research reports consisting of 15 categories from [20] for constructing the thesaurus, which are a subset of the collection of the research reports of the National Science Council (NSC), Republic of China. The number of documents in each category is between 13 and 15. Each document consists of several parts, including a report ID, a title, a Chinese abstract, an English abstract,..., etc., and a document may belong to many different categories at the same time. The system gets the report ID and the English abstract of each report automatically and uses the stem method [1] to sieve out the roots of terms and to generate the term database based on these term roots. We use the proposed method shown in Figure 1 to construct the thesaurus automatically based on the document clusters obtained by [5] and apply the constructed thesaurus to expand the original user s queries. In our experiment, we set the threshold value α = 0.5 when executing the automatic thesaurus construction algorithm and set the value β = 5 when executing the query expansion algorithm. We test the performance of query expansion on the same database [20] by observing the improvement of retrieval performance of the nine queries shown in Table 1. Figure 2 and Figure 3 compare recall rate and the precision rate of the top 20 documents of the nine queries, respectively, where the precision rate and the recall rate are defined as follows [1]: Precision rate = R e R a, (16)

14 312 Information and Management Sciences, Vol. 18, No. 4, December, 2007 Recall rate = R e R r, (17) where R e denotes the number of relevant retrieved documents, R r denotes the number of relevant documents in the collection, and R a denotes the number of retrieved documents. From Figure 2 and Figure 3, we can see that the query expansion method proposed in this paper can expand the user s queries and gets higher precision rates and recall rates. Table 1. A list of the user s queries. Q 1 : Heterogeneous Database Q 2 : Natural Language Porcessing Q 3 : Network Security Q 4 : Multimedia Database Q 5 : Parallel Computing Q 6 : Speech Recognition Q 7 : Expert System Q 8 : Mobile Communication Q 9 : Robot Arm 1 Precision Rate The Proposed Mehtod The Original User's Queries 0 Queries Figure 2. The precision rates of the top 20 retrieved documents of the user s queries. 1 Precision Rate The Proposed Mehtod The Original User's Queries 0 Queries Figure 3. The recall rates of the top 20 retrieved documents of the user s queries.

15 A New Approach for Automatic Thesaurus Construction and Query Expansion Conclusions In this paper, we have presented a new approach for automatic thesaurus construction and query expansion for document retrieval. We analyze the information between any two terms in each document cluster center of final document clusters or intermediate document clusters in the clustering process to automatically construct the thesaurus, where these information includes the co-occurrence frequency of any two terms in each document cluster center, the degree of effect of each term in each document cluster center and the inner noise of each document cluster, respectively. The thesaurus is automatically constructed using a network structure. We also have presented a query expansion algorithm to expand the user s queries and present a new method to calculate the degree of similarity between the user s query and documents. The proposed thesaurus construction method and the proposed query expansion method can improve the performance of information retrieval systems for dealing with document retrieval. References [1] Baeza-Yates, R. and Ribeiro-Neto, B., Modern Information Retrieval, ACM Press, New York, [2] Billhardt, H., Borrago, D. and Maojo, V., A context vector model for information retrieval, Journal of the American Society for Information Science and Technology, Vol.53, No.3, pp , [3] Buckley, C., Salton, G., Alan, J. and Singhal, A., Automatic query expansion using SMART, Proceedings of the 3rd Text Retrieval Conference (TREC-3), edited by Donna K. Harman, National Institute of Standards and Technology, Gaithersburg, MD, pp.69-80, [4] Chang, Y. C., Chen, S. M. and Liau, C. J., A new query expansion method based on fuzzy rules, Proceedings of the 2003 Joint Conference on AI, Fuzzy System, and Grey System, Taipei, Taiwan, Republic of China, [5] Chen, L. Y. and Chen, S. M., A new fuzzy hierarchical clustering method based on dynamic cluster centers, in Proceedings of the Ninth Conference on Information Management Research, Changhua, Taiwan, Republic of China, [6] Chen, L. Y. and Chen, S. M., A new method for automatic thesaurus construction and query expansion, Proceedings of the th International Conference on Information Management, Taipei, Taiwan, Republic of China, [7] Chen, C. L. P. and Lu, Y., FUZZ: A fuzzy-based concept formation system that integrates human categorization and numerical clustering, IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics, Vol.27, No.1, pp.79-94, [8] Croft, W. and Harper, D. J., Using probabilistic models of document retrieval without relevance information, Journal of Documentation, Vol.35, No.4, pp , [9] Crouch, C. J., An approach to the automatic construction of global thesauri, Information Processing & Management, Vol.26, pp , [10] He, Y., Hui, S. C. and Fong, A. C. M., Mining a Web citation database for document clustering, Applied Artificial Intelligence, Vol.16, No.4, pp , [11] Horng, Y. J., Chen, S. M. and Lee, C. H., A new fuzzy information retrieval method based on document terms reweighting techniques, International Journal of Information and Management Sciences, Vol.14, No.4, pp.63-82, 2003.

16 314 Information and Management Sciences, Vol. 18, No. 4, December, 2007 [12] Horng, Y. J., Chen, S. M. and Lee, C. H., Fuzzy information retrieval using fuzzy hierarchical clustering and fuzzy inference techniques, Proceedings of the 13th International Conference on Information Management, Taipei, Taiwan, Republic of China, pp , [13] Ide, E., New experiments in relevance feedback, in The SMART Retrieval System, edited by G. Salton, Prentice Hall, Englewood Cliffs, NJ, pp , [14] Jing, Y. and Croft, W. B., An association thesaurus for information retrieval, Proceedings of the 1994 Intelligent Multimedia Information Retrieval Systems, NY, pp , [15] Jones, K. S., Automatic Keyword Classification for Information Retrieval, Butterworths, London, UK, [16] Kalczynski, P. J. and Chou, A., Temporal document retrieval model for business news archives, Information Processing and Management, Vol.41, No.3, pp , [17] Ma, Z. M., Zhang, W. J. and W. Y. Ma, Extending object-oriented databases for fuzzy information modeling, Information Systems, Vol.29, No.5, pp , [18] Qiu, Y. and Frei, H. P., Concept based query expansion, Proceedings of the 16th Annual International ACM Conference on Research and Development in Information Retrieval, NY, pp , [19] Salton, G., The Smart Retrieval System - Experiments in Automatic Document Processing, Prentice Hall, Englewood Cliffs, NJ, [20] A Subset of the Collection of the Research Reports of the National Science Council, Taiwan, R. O. C., Report Data base/292documents.html (Data Source: Authors Information Liang-Yu Chen received the B.S. degree from the Department of Computer Science and Information Engineering, Tamkang University, Taipei, Taiwan, Republic of China, in June 2002, and received the M.S. degree in the Department of Computer Science and Information Engineering at National Taiwan University of Science and Technology, Taipei, Taiwan, in June His current research interests include information retrieval systems, fuzzy systems, and artificial intelligence. Department of Electronic Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan, R. O. C. M @mail.ntust.edu.tw Shyi-Ming Chen is currently a Professor in the Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan, R. O. C. He received the Ph.D. degree in Electrical Engineering from the National Taiwan University, Taipei, Taiwan, in June He has published more than 260 papers in referred journals, book chapters and conference proceedings. His research interests include fuzzy systems, information retrieval systems, knowledge-based systems, neural networks, artificial intelligence, data mining, and genetic algorithms. He is currently the President of the Taiwanese Association for Artificial Intelligence (TAAI). He is an Associate Editor of the IEEE Transactions on Systems, Man, and Cybernetics - Part C, an Associate Editor of the IEEE Computational Intelligence Magazine, an Associate Editor of the Journal of Intelligent & Fuzzy Systems, an Editor of the New Mathematics and Natural Computation Journal, an Associate Editor of the International Journal of Fuzzy Systems, an Editorial Board Member of the International Journal of Information and Communication Technology, an Editorial Board Member of the WSEAS Transactions on Systems, an Associate Editor of the WSEAS Transactions on Computers, an Editor of the Journal of Advanced Computational Intelligence and Intelligent Informatics, an Associate Editor of the International Journal of Applied Intelligence, an Associate Editor of the International Journal of Artificial Intelligence Tools, an Editorial Board Member of the International Journal of Computational Intelligence and Applications, an Editorial Board Member of the Advances in Fuzzy Sets and Systems Journal, an Editor of the International Journal of Soft Computing, an Editor of the Asian Journal of Information

17 A New Approach for Automatic Thesaurus Construction and Query Expansion 315 Technology, an Editorial Board Member of the International Journal of Intelligence Systems Technologies and Applications, an Editor of the Asian Journal of Information Management, an Associate Editor of the International Journal of Innovative Computing, Information and Control, an Editorial Board Member of the International Journal of Computer Applications in Technology, an Associate Editor of the Journal of Uncertain Systems, an Editorial Board Member of the Advances in Computer Sciences and Engineering Journal, and an Associate Editor of the International Journal of Intelligent Information and Database Systems. He was an Editor of the Journal of the Chinese Grey System Association from 1998 to He is currently also the Dean of the College of Electrical Engineering and Computer Science, Jinwen University of Science and Technology, Taipei, Taiwan, R.O.C. He is an IET Fellow (Fellow of IEE). Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan, R.O.C. Tel:

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections

More information

Making Retrieval Faster Through Document Clustering

Making Retrieval Faster Through Document Clustering R E S E A R C H R E P O R T I D I A P Making Retrieval Faster Through Document Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-02 January 23, 2004 D a l l e M o l l e I n s t i t u t e

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

A New Approach for Handling the Iris Data Classification Problem

A New Approach for Handling the Iris Data Classification Problem International Journal of Applied Science and Engineering 2005. 3, : 37-49 A New Approach for Handling the Iris Data Classification Problem Shyi-Ming Chen a and Yao-De Fang b a Department of Computer Science

More information

An Adaptive Agent for Web Exploration Based on Concept Hierarchies

An Adaptive Agent for Web Exploration Based on Concept Hierarchies An Adaptive Agent for Web Exploration Based on Concept Hierarchies Scott Parent, Bamshad Mobasher, Steve Lytinen School of Computer Science, Telecommunication and Information Systems DePaul University

More information

A Reduce Identical Composite Event Transmission Algorithm for Wireless Sensor Networks

A Reduce Identical Composite Event Transmission Algorithm for Wireless Sensor Networks Appl. Math. Inf. Sci. 6 No. 2S pp. 713S-719S (2012) Applied Mathematics & Information Sciences An International Journal @ 2012 NSP Natural Sciences Publishing Cor. A Reduce Identical Composite Event Transmission

More information

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Outline Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Lecture 10 CS 410/510 Information Retrieval on the Internet Query reformulation Sources of relevance for feedback Using

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

A Survey on Postive and Unlabelled Learning

A Survey on Postive and Unlabelled Learning A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled

More information

A New Technique to Optimize User s Browsing Session using Data Mining

A New Technique to Optimize User s Browsing Session using Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 5 Relevance Feedback and Query Expansion Introduction A Framework for Feedback Methods Explicit Relevance Feedback Explicit Feedback Through Clicks Implicit Feedback

More information

An Automatic Reply to Customers Queries Model with Chinese Text Mining Approach

An Automatic Reply to Customers  Queries Model with Chinese Text Mining Approach Proceedings of the 6th WSEAS International Conference on Applied Computer Science, Hangzhou, China, April 15-17, 2007 71 An Automatic Reply to Customers E-mail Queries Model with Chinese Text Mining Approach

More information

Using Gini-index for Feature Weighting in Text Categorization

Using Gini-index for Feature Weighting in Text Categorization Journal of Computational Information Systems 9: 14 (2013) 5819 5826 Available at http://www.jofcis.com Using Gini-index for Feature Weighting in Text Categorization Weidong ZHU 1,, Yongmin LIN 2 1 School

More information

Web Information Retrieval using WordNet

Web Information Retrieval using WordNet Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT

More information

A Novel PAT-Tree Approach to Chinese Document Clustering

A Novel PAT-Tree Approach to Chinese Document Clustering A Novel PAT-Tree Approach to Chinese Document Clustering Kenny Kwok, Michael R. Lyu, Irwin King Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong

More information

A probabilistic description-oriented approach for categorising Web documents

A probabilistic description-oriented approach for categorising Web documents A probabilistic description-oriented approach for categorising Web documents Norbert Gövert Mounia Lalmas Norbert Fuhr University of Dortmund {goevert,mounia,fuhr}@ls6.cs.uni-dortmund.de Abstract The automatic

More information

A Model for Information Retrieval Agent System Based on Keywords Distribution

A Model for Information Retrieval Agent System Based on Keywords Distribution A Model for Information Retrieval Agent System Based on Keywords Distribution Jae-Woo LEE Dept of Computer Science, Kyungbok College, 3, Sinpyeong-ri, Pocheon-si, 487-77, Gyeonggi-do, Korea It2c@koreaackr

More information

A Method of Identifying the P2P File Sharing

A Method of Identifying the P2P File Sharing IJCSNS International Journal of Computer Science and Network Security, VOL.10 No.11, November 2010 111 A Method of Identifying the P2P File Sharing Jian-Bo Chen Department of Information & Telecommunications

More information

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts Kwangcheol Shin 1, Sang-Yong Han 1, and Alexander Gelbukh 1,2 1 Computer Science and Engineering Department, Chung-Ang University,

More information

Improving Document Retrieval by Automatic Query Expansion Using Collaborative Learning of Term-Based Concepts

Improving Document Retrieval by Automatic Query Expansion Using Collaborative Learning of Term-Based Concepts Improving Document Retrieval by Automatic Query Expansion Using Collaborative Learning of Term-Based Concepts Stefan Klink, Armin Hust, Markus Junker, and Andreas Dengel German Research Center for Artificial

More information

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014. A B S T R A C T International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Information Retrieval Models and Searching Methodologies: Survey Balwinder Saini*,Vikram Singh,Satish

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation

More information

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LVII, Number 4, 2012 CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL IOAN BADARINZA AND ADRIAN STERCA Abstract. In this paper

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline Relevance Feedback and Query Reformulation Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price IR on the Internet, Spring 2010 1 Outline Query reformulation Sources of relevance

More information

A Hierarchical Document Clustering Approach with Frequent Itemsets

A Hierarchical Document Clustering Approach with Frequent Itemsets A Hierarchical Document Clustering Approach with Frequent Itemsets Cheng-Jhe Lee, Chiun-Chieh Hsu, and Da-Ren Chen Abstract In order to effectively retrieve required information from the large amount of

More information

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _ COURSE DELIVERY PLAN - THEORY Page 1 of 6 Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _ LP: CS6007 Rev. No: 01 Date: 27/06/2017 Sub.

More information

Research Article A Two-Level Cache for Distributed Information Retrieval in Search Engines

Research Article A Two-Level Cache for Distributed Information Retrieval in Search Engines The Scientific World Journal Volume 2013, Article ID 596724, 6 pages http://dx.doi.org/10.1155/2013/596724 Research Article A Two-Level Cache for Distributed Information Retrieval in Search Engines Weizhe

More information

An Edge-Based Algorithm for Spatial Query Processing in Real-Life Road Networks

An Edge-Based Algorithm for Spatial Query Processing in Real-Life Road Networks An Edge-Based Algorithm for Spatial Query Processing in Real-Life Road Networks Ye-In Chang, Meng-Hsuan Tsai, and Xu-Lun Wu Abstract Due to wireless communication technologies, positioning technologies,

More information

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News Selecting Model in Automatic Text Categorization of Chinese Industrial 1) HUEY-MING LEE 1 ), PIN-JEN CHEN 1 ), TSUNG-YEN LEE 2) Department of Information Management, Chinese Culture University 55, Hwa-Kung

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Ricardo Baeza-Yates Berthier Ribeiro-Neto ACM Press NewYork Harlow, England London New York Boston. San Francisco. Toronto. Sydney Singapore Hong Kong Tokyo Seoul Taipei. New

More information

Visualizing Changes in Data Collections Using Growing Self-Organizing Maps *

Visualizing Changes in Data Collections Using Growing Self-Organizing Maps * Visualizing Changes in Data Collections Using Growing Self-Organizing Maps * Andreas Nürnberger and Marcin Detyniecki University of California at Berkeley EECS, Computer Science Division Berkeley, CA 94720,

More information

York University at CLEF ehealth 2015: Medical Document Retrieval

York University at CLEF ehealth 2015: Medical Document Retrieval York University at CLEF ehealth 2015: Medical Document Retrieval Andia Ghoddousi Jimmy Xiangji Huang Information Retrieval and Knowledge Management Research Lab Department of Computer Science and Engineering

More information

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A thesis Submitted to the faculty of the graduate school of the University of Minnesota by Vamshi Krishna Thotempudi In partial fulfillment of the requirements

More information

Noisy Text Clustering

Noisy Text Clustering R E S E A R C H R E P O R T Noisy Text Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-31 I D I A P December 2004 1 IDIAP, CP 592, 1920 Martigny, Switzerland, grangier@idiap.ch 2 IDIAP,

More information

Texture classification using fuzzy uncertainty texture spectrum

Texture classification using fuzzy uncertainty texture spectrum Neurocomputing 20 (1998) 115 122 Texture classification using fuzzy uncertainty texture spectrum Yih-Gong Lee*, Jia-Hong Lee, Yuang-Cheh Hsueh Department of Computer and Information Science, National Chiao

More information

Temperature Calculation of Pellet Rotary Kiln Based on Texture

Temperature Calculation of Pellet Rotary Kiln Based on Texture Intelligent Control and Automation, 2017, 8, 67-74 http://www.scirp.org/journal/ica ISSN Online: 2153-0661 ISSN Print: 2153-0653 Temperature Calculation of Pellet Rotary Kiln Based on Texture Chunli Lin,

More information

Data Hiding on Text Using Big-5 Code

Data Hiding on Text Using Big-5 Code Data Hiding on Text Using Big-5 Code Jun-Chou Chuang 1 and Yu-Chen Hu 2 1 Department of Computer Science and Communication Engineering Providence University 200 Chung-Chi Rd., Shalu, Taichung 43301, Republic

More information

SVM-based Filter Using Evidence Theory and Neural Network for Image Denosing

SVM-based Filter Using Evidence Theory and Neural Network for Image Denosing Journal of Software Engineering and Applications 013 6 106-110 doi:10.436/sea.013.63b03 Published Online March 013 (http://www.scirp.org/ournal/sea) SVM-based Filter Using Evidence Theory and Neural Network

More information

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Paper / Case Study Available online at: www.ijarcsms.com

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Relevance Feedback. Query Expansion Instructor: Rada Mihalcea Intelligent Information Retrieval 1. Relevance feedback - Direct feedback - Pseudo feedback 2. Query expansion

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Patent Classification Using Ontology-Based Patent Network Analysis

Patent Classification Using Ontology-Based Patent Network Analysis Association for Information Systems AIS Electronic Library (AISeL) PACIS 2010 Proceedings Pacific Asia Conference on Information Systems (PACIS) 2010 Patent Classification Using Ontology-Based Patent Network

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

Concept-Based Document Similarity Based on Suffix Tree Document

Concept-Based Document Similarity Based on Suffix Tree Document Concept-Based Document Similarity Based on Suffix Tree Document *P.Perumal Sri Ramakrishna Engineering College Associate Professor Department of CSE, Coimbatore perumalsrec@gmail.com R. Nedunchezhian Sri

More information

Robust Relevance-Based Language Models

Robust Relevance-Based Language Models Robust Relevance-Based Language Models Xiaoyan Li Department of Computer Science, Mount Holyoke College 50 College Street, South Hadley, MA 01075, USA Email: xli@mtholyoke.edu ABSTRACT We propose a new

More information

Knowledge Engineering in Search Engines

Knowledge Engineering in Search Engines San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2012 Knowledge Engineering in Search Engines Yun-Chieh Lin Follow this and additional works at:

More information

Optimization of HMM by the Tabu Search Algorithm

Optimization of HMM by the Tabu Search Algorithm JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 20, 949-957 (2004) Optimization of HMM by the Tabu Search Algorithm TSONG-YI CHEN, XIAO-DAN MEI *, JENG-SHYANG PAN AND SHENG-HE SUN * Department of Electronic

More information

Improving the Effectiveness of Information Retrieval with Local Context Analysis

Improving the Effectiveness of Information Retrieval with Local Context Analysis Improving the Effectiveness of Information Retrieval with Local Context Analysis JINXI XU BBN Technologies and W. BRUCE CROFT University of Massachusetts Amherst Techniques for automatic query expansion

More information

Relevancy Measurement of Retrieved Webpages Using Ruzicka Similarity Measure

Relevancy Measurement of Retrieved Webpages Using Ruzicka Similarity Measure Relevancy Measurement of Retrieved Webpages Using Ruzicka Similarity Measure Manjeet*, Jaswinder Singh** *Master of Technology (Dept. of Computer Science and Engineering) GJUS&T, Hisar, Haryana, India

More information

Impact of Term Weighting Schemes on Document Clustering A Review

Impact of Term Weighting Schemes on Document Clustering A Review Volume 118 No. 23 2018, 467-475 ISSN: 1314-3395 (on-line version) url: http://acadpubl.eu/hub ijpam.eu Impact of Term Weighting Schemes on Document Clustering A Review G. Hannah Grace and Kalyani Desikan

More information

A Novel Feature Selection Framework for Automatic Web Page Classification

A Novel Feature Selection Framework for Automatic Web Page Classification International Journal of Automation and Computing 9(4), August 2012, 442-448 DOI: 10.1007/s11633-012-0665-x A Novel Feature Selection Framework for Automatic Web Page Classification J. Alamelu Mangai 1

More information

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT A Patent Retrieval Method Using a Hierarchy of Clusters at TUT Hironori Doi Yohei Seki Masaki Aono Toyohashi University of Technology 1-1 Hibarigaoka, Tenpaku-cho, Toyohashi-shi, Aichi 441-8580, Japan

More information

Using Query History to Prune Query Results

Using Query History to Prune Query Results Using Query History to Prune Query Results Daniel Waegel Ursinus College Department of Computer Science dawaegel@gmail.com April Kontostathis Ursinus College Department of Computer Science akontostathis@ursinus.edu

More information

characteristic on several topics. Part of the reason is the free publication and multiplication of the Web such that replicated pages are repeated in

characteristic on several topics. Part of the reason is the free publication and multiplication of the Web such that replicated pages are repeated in Hypertext Information Retrieval for Short Queries Chia-Hui Chang and Ching-Chi Hsu Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan 106 E-mail: fchia,

More information

Optimal Design of Steel Columns with Axial Load Using Artificial Neural Networks

Optimal Design of Steel Columns with Axial Load Using Artificial Neural Networks 2017 2nd International Conference on Applied Mechanics and Mechatronics Engineering (AMME 2017) ISBN: 978-1-60595-521-6 Optimal Design of Steel Columns with Axial Load Using Artificial Neural Networks

More information

Reactive Ranking for Cooperative Databases

Reactive Ranking for Cooperative Databases Reactive Ranking for Cooperative Databases Berthier A. Ribeiro-Neto Guilherme T. Assis Computer Science Department Federal University of Minas Gerais Brazil berthiertavares @dcc.ufmg.br Abstract A cooperative

More information

Ontology-Based Web Query Classification for Research Paper Searching

Ontology-Based Web Query Classification for Research Paper Searching Ontology-Based Web Query Classification for Research Paper Searching MyoMyo ThanNaing University of Technology(Yatanarpon Cyber City) Mandalay,Myanmar Abstract- In web search engines, the retrieval of

More information

A Two-Tier Distributed Full-Text Indexing System

A Two-Tier Distributed Full-Text Indexing System Appl. Math. Inf. Sci. 8, No. 1, 321-326 (2014) 321 Applied Mathematics & Information Sciences An International Journal http://dx.doi.org/10.12785/amis/080139 A Two-Tier Distributed Full-Text Indexing System

More information

Two Algorithms of Image Segmentation and Measurement Method of Particle s Parameters

Two Algorithms of Image Segmentation and Measurement Method of Particle s Parameters Appl. Math. Inf. Sci. 6 No. 1S pp. 105S-109S (2012) Applied Mathematics & Information Sciences An International Journal @ 2012 NSP Natural Sciences Publishing Cor. Two Algorithms of Image Segmentation

More information

QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL

QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL David Parapar, Álvaro Barreiro AILab, Department of Computer Science, University of A Coruña, Spain dparapar@udc.es, barreiro@udc.es

More information

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming Proceedings of the 2012 International Conference on Industrial Engineering and Operations Management Istanbul, Turkey, July 3 6, 2012 Automatic New Topic Identification in Search Engine Transaction Log

More information

A Metric for Inferring User Search Goals in Search Engines

A Metric for Inferring User Search Goals in Search Engines International Journal of Engineering and Technical Research (IJETR) A Metric for Inferring User Search Goals in Search Engines M. Monika, N. Rajesh, K.Rameshbabu Abstract For a broad topic, different users

More information

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL Lim Bee Huang 1, Vimala Balakrishnan 2, Ram Gopal Raj 3 1,2 Department of Information System, 3 Department

More information

Using Statistical Properties of Text to Create. Metadata. Computer Science and Electrical Engineering Department

Using Statistical Properties of Text to Create. Metadata. Computer Science and Electrical Engineering Department Using Statistical Properties of Text to Create Metadata Grace Crowder crowder@cs.umbc.edu Charles Nicholas nicholas@cs.umbc.edu Computer Science and Electrical Engineering Department University of Maryland

More information

Chapter 3 - Text. Management and Retrieval

Chapter 3 - Text. Management and Retrieval Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter 3 - Text Management and Retrieval Literature: Baeza-Yates, R.;

More information

International ejournals

International ejournals Available online at www.internationalejournals.com International ejournals ISSN 0976 1411 International ejournal of Mathematics and Engineering 112 (2011) 1023-1029 ANALYZING THE REQUIREMENTS FOR TEXT

More information

Arbee L.P. Chen ( 陳良弼 )

Arbee L.P. Chen ( 陳良弼 ) Arbee L.P. Chen ( 陳良弼 ) Asia University Taichung, Taiwan EDUCATION Phone: (04)23323456x1011 Email: arbee@asia.edu.tw - Ph.D. in Computer Engineering, Department of Electrical Engineering, University of

More information

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.4, April 2009 349 WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY Mohammed M. Sakre Mohammed M. Kouta Ali M. N. Allam Al Shorouk

More information

Domain-specific Concept-based Information Retrieval System

Domain-specific Concept-based Information Retrieval System Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical

More information

Direction-Length Code (DLC) To Represent Binary Objects

Direction-Length Code (DLC) To Represent Binary Objects IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 2, Ver. I (Mar-Apr. 2016), PP 29-35 www.iosrjournals.org Direction-Length Code (DLC) To Represent Binary

More information

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval

More information

60-538: Information Retrieval

60-538: Information Retrieval 60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are

More information

Prior Art Retrieval Using Various Patent Document Fields Contents

Prior Art Retrieval Using Various Patent Document Fields Contents Prior Art Retrieval Using Various Patent Document Fields Contents Metti Zakaria Wanagiri and Mirna Adriani Fakultas Ilmu Komputer, Universitas Indonesia Depok 16424, Indonesia metti.zakaria@ui.edu, mirna@cs.ui.ac.id

More information

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s Summary agenda Summary: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University March 13, 2013 A Ardö, EIT Summary: EITN01 Web Intelligence

More information

Research Article Path Planning Using a Hybrid Evolutionary Algorithm Based on Tree Structure Encoding

Research Article Path Planning Using a Hybrid Evolutionary Algorithm Based on Tree Structure Encoding e Scientific World Journal, Article ID 746260, 8 pages http://dx.doi.org/10.1155/2014/746260 Research Article Path Planning Using a Hybrid Evolutionary Algorithm Based on Tree Structure Encoding Ming-Yi

More information

Dependence among Terms in Vector Space Model

Dependence among Terms in Vector Space Model Dependence among Terms in Vector Space Model Ilmério Reis Silva, João Nunes de Souza, Karina Silveira Santos Faculdade de Computação - Universidade Federal de Uberlândia (UFU) e-mail: [ilmerio, Nunes]@facom.ufu.br,

More information

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao Large Scale Chinese News Categorization --based on Improved Feature Selection Method Peng Wang Joint work with H. Zhang, B. Xu, H.W. Hao Computational-Brain Research Center Institute of Automation, Chinese

More information

Information Retrieval. Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent

More information

Interest-based Recommendation in Digital Library

Interest-based Recommendation in Digital Library Journal of Computer Science (): 40-46, 2005 ISSN 549-3636 Science Publications, 2005 Interest-based Recommendation in Digital Library Yan Yang and Jian Zhong Li School of Computer Science and Technology,

More information

A DYNAMIC CONTROLLING SCHEME FOR A TRACKING SYSTEM. Received February 2009; accepted April 2009

A DYNAMIC CONTROLLING SCHEME FOR A TRACKING SYSTEM. Received February 2009; accepted April 2009 ICIC Express Letters ICIC International c 2009 ISSN 1881-803X Volume 3, Number 2, June 2009 pp. 219 223 A DYNAMIC CONTROLLING SCHEME FOR A TRACKING SYSTEM Ming-Liang Li 1,Yu-KueiChiu 1, Yi-Nung Chung 2

More information

Birkbeck (University of London)

Birkbeck (University of London) Birkbeck (University of London) MSc Examination for Internal Students Department of Computer Science and Information Systems Information Retrieval and Organisation (COIY64H7) Credit Value: 5 Date of Examination:

More information

An Objective Evaluation Methodology for Handwritten Image Document Binarization Techniques

An Objective Evaluation Methodology for Handwritten Image Document Binarization Techniques An Objective Evaluation Methodology for Handwritten Image Document Binarization Techniques K. Ntirogiannis, B. Gatos and I. Pratikakis Computational Intelligence Laboratory, Institute of Informatics and

More information

RMIT University at TREC 2006: Terabyte Track

RMIT University at TREC 2006: Terabyte Track RMIT University at TREC 2006: Terabyte Track Steven Garcia Falk Scholer Nicholas Lester Milad Shokouhi School of Computer Science and IT RMIT University, GPO Box 2476V Melbourne 3001, Australia 1 Introduction

More information

Cluster-based Similarity Aggregation for Ontology Matching

Cluster-based Similarity Aggregation for Ontology Matching Cluster-based Similarity Aggregation for Ontology Matching Quang-Vinh Tran 1, Ryutaro Ichise 2, and Bao-Quoc Ho 1 1 Faculty of Information Technology, Ho Chi Minh University of Science, Vietnam {tqvinh,hbquoc}@fit.hcmus.edu.vn

More information

Information Retrieval. hussein suleman uct cs

Information Retrieval. hussein suleman uct cs Information Management Information Retrieval hussein suleman uct cs 303 2004 Introduction Information retrieval is the process of locating the most relevant information to satisfy a specific information

More information

Kohei Arai 1 Graduate School of Science and Engineering Saga University Saga City, Japan

Kohei Arai 1 Graduate School of Science and Engineering Saga University Saga City, Japan Numerical Representation of Web Sites of Remote Sensing Satellite Data Providers and Its Application to Knowledge Based Information Retrievals with Natural Language Kohei Arai 1 Graduate School of Science

More information

Retrieval of Web Documents Using a Fuzzy Hierarchical Clustering

Retrieval of Web Documents Using a Fuzzy Hierarchical Clustering International Journal of Computer Applications (97 8887) Volume No., August 2 Retrieval of Documents Using a Fuzzy Hierarchical Clustering Deepti Gupta Lecturer School of Computer Science and Information

More information

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1 Acknowledgements This lecture series has been sponsored by the European

More information

A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System

A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System Takashi Yukawa Nagaoka University of Technology 1603-1 Kamitomioka-cho, Nagaoka-shi Niigata, 940-2188 JAPAN

More information

TERM BASED SIMILARITY MEASURE FOR TEXT CLASSIFICATION AND CLUSTERING USING FUZZY C-MEANS ALGORITHM

TERM BASED SIMILARITY MEASURE FOR TEXT CLASSIFICATION AND CLUSTERING USING FUZZY C-MEANS ALGORITHM TERM BASED SIMILARITY MEASURE FOR TEXT CLASSIFICATION AND CLUSTERING USING FUZZY C-MEANS ALGORITHM D. Renukadevi, S. Sumathi Abstract The progress of information technology and increasing usability of

More information

Effective Information Retrieval using Genetic Algorithms based Matching Functions Adaptation

Effective Information Retrieval using Genetic Algorithms based Matching Functions Adaptation Effective Information Retrieval using Genetic Algorithms based Matching Functions Adaptation Praveen Pathak Michael Gordon Weiguo Fan Purdue University University of Michigan pathakp@mgmt.purdue.edu mdgordon@umich.edu

More information

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback RMIT @ TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback Ameer Albahem ameer.albahem@rmit.edu.au Lawrence Cavedon lawrence.cavedon@rmit.edu.au Damiano

More information

ICTNET at Web Track 2010 Diversity Task

ICTNET at Web Track 2010 Diversity Task ICTNET at Web Track 2010 Diversity Task Yuanhai Xue 1,2, Zeying Peng 1,2, Xiaoming Yu 1, Yue Liu 1, Hongbo Xu 1, Xueqi Cheng 1 1. Institute of Computing Technology, Chinese Academy of Sciences, Beijing,

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Feature-Guided K-Means Algorithm for Optimal Image Vector Quantizer Design

Feature-Guided K-Means Algorithm for Optimal Image Vector Quantizer Design Journal of Information Hiding and Multimedia Signal Processing c 2017 ISSN 2073-4212 Ubiquitous International Volume 8, Number 6, November 2017 Feature-Guided K-Means Algorithm for Optimal Image Vector

More information

An Apriori-like algorithm for Extracting Fuzzy Association Rules between Keyphrases in Text Documents

An Apriori-like algorithm for Extracting Fuzzy Association Rules between Keyphrases in Text Documents An Apriori-lie algorithm for Extracting Fuzzy Association Rules between Keyphrases in Text Documents Guy Danon Department of Information Systems Engineering Ben-Gurion University of the Negev Beer-Sheva

More information

A Proposed Model For Forecasting Stock Markets Based On Clustering Algorithm And Fuzzy Time Series

A Proposed Model For Forecasting Stock Markets Based On Clustering Algorithm And Fuzzy Time Series Journal of Multidisciplinary Engineering Science Studies (JMESS) A Proposed Model For Forecasting Stock Markets Based On Clustering Algorithm And Fuzzy Time Series Nghiem Van Tinh Thai Nguyen University

More information